I'm seeing some strange performance behaviour when processing images captured with V4L2. I've reduced it to the following test case (I've attached the full source code):
int process_image(const unsigned char *p, int size)
{
int i, j, sum;
if (do_copy) {
memcpy(copy_buf, p, size);
p = copy_buf;
}
sum = 0;
if (do_rows) {
for (i = 0; i < size; i++)
sum |= p[i];
} else {
/* Read all the pixels in non-optimal order */
for (i = 0; i < fmt.fmt.pix.bytesperline; i++) {
for (j = 0; j < fmt.fmt.pix.height; j++) {
sum |= p[i + j * fmt.fmt.pix.bytesperline];
}
}
}
}
I'm measuring the performance using 'perf stat -e cpu-clock ./capture-test -c 100', running on a Boundary Device SABRE Lite, with an OV5642 sensor. The kernel is the Boundary devices kernel from the dora branch of Yocto.
Here's the results I get:
do_copy=0, do_rows=0: 1458.861001 cpu-clock
do_copy=0, do_rows=1: 2207.192003 cpu-clock
do_copy=1, do_rows=0: 624.319663 cpu-clock
do_copy=1, do_rows=1: 461.399335 cpu-clock
There's two strange things about these results. First, doing a memcpy makes things a lot faster. Second, when not doing a memcpy, row-wise traversal is much slower.
Does anyone know why this is happening, and how I can make it faster without doing a memcpy?
The kernel is allocating the buffers with dma_alloc_coherent. I've tried changing to kmalloc and dma_map_single/dma_unmap_single in the QBUF/DQBUF ioctls, but that made no difference.
Also note that if I run the same test code on my laptop, then I get the expected behaviour (row-wise traversal is faster, and memcpy is slower).
Original Attachment has been moved to: capture-test.c.zip
V4L2 buffers allocated with V4L2_MEMORY_MMAP are not cachable.
CPU accesses to such buffers are very slow (every access goes all the way to DDR).
You should use V4L2_MEMORY_USERPTR ... But, I couldn't make it work.
Were you able to resolve the issue eventually?
Regards,
Erez
did someone got a solution?
V4L2_MEMORY_MMAP very slow memcpy
V4L2_MEMORY_USERPTR don't work even if i allocate memory with memalign(page_size, framesize);
I'm using imx6ull. and mx6s_capture module
yes, i know it is old thread...
Hey Philip,
Did you figured out what is the problem or what's happening? I also tried your code and it gives the same result on a iMX6 processor.