V4L2 capture buffer performance

philipcraig — Mon, 16 Dec 2013 03:45:28 GMT

I'm seeing some strange performance behaviour when processing images captured with V4L2. I've reduced it to the following test case (I've attached the full source code):

int process_image(const unsigned char *p, int size)

{

int i, j, sum;

if (do_copy) {

memcpy(copy_buf, p, size);

p = copy_buf;

}

sum = 0;

if (do_rows) {

for (i = 0; i < size; i++)

sum |= p[i];

} else {

/* Read all the pixels in non-optimal order */

for (i = 0; i < fmt.fmt.pix.bytesperline; i++) {

for (j = 0; j < fmt.fmt.pix.height; j++) {

sum |= p[i + j * fmt.fmt.pix.bytesperline];

}

I'm measuring the performance using 'perf stat -e cpu-clock ./capture-test -c 100', running on a Boundary Device SABRE Lite, with an OV5642 sensor. The kernel is the Boundary devices kernel from the dora branch of Yocto.

Here's the results I get:

do_copy=0, do_rows=0: 1458.861001 cpu-clock

do_copy=0, do_rows=1: 2207.192003 cpu-clock

do_copy=1, do_rows=0: 624.319663 cpu-clock

do_copy=1, do_rows=1: 461.399335 cpu-clock

There's two strange things about these results. First, doing a memcpy makes things a lot faster. Second, when not doing a memcpy, row-wise traversal is much slower.

Does anyone know why this is happening, and how I can make it faster without doing a memcpy?

The kernel is allocating the buffers with dma_alloc_coherent. I've tried changing to kmalloc and dma_map_single/dma_unmap_single in the QBUF/DQBUF ioctls, but that made no difference.

Also note that if I run the same test code on my laptop, then I get the expected behaviour (row-wise traversal is faster, and memcpy is slower).

Original Attachment has been moved to: capture-test.c.zip

Re: V4L2 capture buffer performance

vladspiridonesc — Fri, 30 Jan 2015 11:56:44 GMT

Hey Philip,

Did you figured out what is the problem or what's happening? I also tried your code and it gives the same result on a iMX6 processor.

Re: V4L2 capture buffer performance

erezsteinberg — Mon, 20 Jul 2015 13:12:46 GMT

V4L2 buffers allocated with V4L2_MEMORY_MMAP are not cachable.

CPU accesses to such buffers are very slow (every access goes all the way to DDR).

You should use V4L2_MEMORY_USERPTR ... But, I couldn't make it work.

Were you able to resolve the issue eventually?

Regards,

Erez

Re: V4L2 capture buffer performance

firex — Fri, 24 Aug 2018 14:24:47 GMT

did someone got a solution?
V4L2_MEMORY_MMAP very slow memcpy

V4L2_MEMORY_USERPTR don't work even if i allocate memory with memalign(page_size, framesize);
I'm using imx6ull. and mx6s_capture module
yes, i know it is old thread...

i.MX ProcessorsのトピックRe: V4L2 capture buffer performance

V4L2 capture buffer performance

Re: V4L2 capture buffer performance

Re: V4L2 capture buffer performance

Re: V4L2 capture buffer performance