AnsweredAssumed Answered

V4L2 capture buffer performance

Question asked by Philip Craig on Dec 15, 2013
Latest reply on Aug 24, 2018 by Andrej Kostrov

I'm seeing some strange performance behaviour when processing images captured with V4L2.  I've reduced it to the following test case (I've attached the full source code):


int process_image(const unsigned char *p, int size)


    int i, j, sum;


    if (do_copy) {

        memcpy(copy_buf, p, size);

        p = copy_buf;



    sum = 0;

    if (do_rows) {

        for (i = 0; i < size; i++)

            sum |= p[i];

    } else {

        /* Read all the pixels in non-optimal order */

        for (i = 0; i < fmt.fmt.pix.bytesperline; i++) {

            for (j = 0; j < fmt.fmt.pix.height; j++) {

                sum |= p[i + j * fmt.fmt.pix.bytesperline];






I'm measuring the performance using 'perf stat -e cpu-clock ./capture-test -c 100', running on a Boundary Device SABRE Lite, with an OV5642 sensor. The kernel is the Boundary devices kernel from the dora branch of Yocto.


Here's the results I get:

do_copy=0, do_rows=0: 1458.861001 cpu-clock

do_copy=0, do_rows=1: 2207.192003 cpu-clock

do_copy=1, do_rows=0: 624.319663 cpu-clock

do_copy=1, do_rows=1: 461.399335 cpu-clock


There's two strange things about these results. First, doing a memcpy makes things a lot faster. Second, when not doing a memcpy, row-wise traversal is much slower.


Does anyone know why this is happening, and how I can make it faster without doing a memcpy?


The kernel is allocating the buffers with dma_alloc_coherent. I've tried changing to kmalloc and dma_map_single/dma_unmap_single in the QBUF/DQBUF ioctls, but that made no difference.


Also note that if I run the same test code on my laptop, then I get the expected behaviour (row-wise traversal is faster, and memcpy is slower).

Original Attachment has been moved to: