V4L2 capture buffer performance

philipcraig · ‎12-15-2013

I'm seeing some strange performance behaviour when processing images captured with V4L2. I've reduced it to the following test case (I've attached the full source code):

int process_image(const unsigned char *p, int size)

{

int i, j, sum;

if (do_copy) {

memcpy(copy_buf, p, size);

p = copy_buf;

}

sum = 0;

if (do_rows) {

for (i = 0; i < size; i++)

sum |= p[i];

} else {

/* Read all the pixels in non-optimal order */

for (i = 0; i < fmt.fmt.pix.bytesperline; i++) {

for (j = 0; j < fmt.fmt.pix.height; j++) {

sum |= p[i + j * fmt.fmt.pix.bytesperline];

}

I'm measuring the performance using 'perf stat -e cpu-clock ./capture-test -c 100', running on a Boundary Device SABRE Lite, with an OV5642 sensor. The kernel is the Boundary devices kernel from the dora branch of Yocto.

Here's the results I get:

do_copy=0, do_rows=0: 1458.861001 cpu-clock

do_copy=0, do_rows=1: 2207.192003 cpu-clock

do_copy=1, do_rows=0: 624.319663 cpu-clock

do_copy=1, do_rows=1: 461.399335 cpu-clock

There's two strange things about these results. First, doing a memcpy makes things a lot faster. Second, when not doing a memcpy, row-wise traversal is much slower.

Does anyone know why this is happening, and how I can make it faster without doing a memcpy?

The kernel is allocating the buffers with dma_alloc_coherent. I've tried changing to kmalloc and dma_map_single/dma_unmap_single in the QBUF/DQBUF ioctls, but that made no difference.

Also note that if I run the same test code on my laptop, then I get the expected behaviour (row-wise traversal is faster, and memcpy is slower).

Original Attachment has been moved to: capture-test.c.zip

erezsteinberg · ‎07-20-2015

V4L2 buffers allocated with V4L2_MEMORY_MMAP are not cachable.

CPU accesses to such buffers are very slow (every access goes all the way to DDR).

You should use V4L2_MEMORY_USERPTR ... But, I couldn't make it work.

Were you able to resolve the issue eventually?

Regards,

Erez

firex · ‎08-24-2018

did someone got a solution?
V4L2_MEMORY_MMAP very slow memcpy

V4L2_MEMORY_USERPTR don't work even if i allocate memory with memalign(page_size, framesize);
I'm using imx6ull. and mx6s_capture module
yes, i know it is old thread...

vladspiridonesc · ‎01-30-2015

Hey Philip,

Did you figured out what is the problem or what's happening? I also tried your code and it gives the same result on a iMX6 processor.

V4L2 capture buffer performance

V4L2 capture buffer performance

i.MX6_All