V4L2 capture buffer performance

取消
显示结果 
显示  仅  | 搜索替代 
您的意思是: 

V4L2 capture buffer performance

2,288 次查看
philipcraig
Contributor II

I'm seeing some strange performance behaviour when processing images captured with V4L2.  I've reduced it to the following test case (I've attached the full source code):

 

int process_image(const unsigned char *p, int size)

{

    int i, j, sum;

 

    if (do_copy) {

        memcpy(copy_buf, p, size);

        p = copy_buf;

    }

 

    sum = 0;

    if (do_rows) {

        for (i = 0; i < size; i++)

            sum |= p[i];

    } else {

        /* Read all the pixels in non-optimal order */

        for (i = 0; i < fmt.fmt.pix.bytesperline; i++) {

            for (j = 0; j < fmt.fmt.pix.height; j++) {

                sum |= p[i + j * fmt.fmt.pix.bytesperline];

            }

        }

    }

}

 

I'm measuring the performance using 'perf stat -e cpu-clock ./capture-test -c 100', running on a Boundary Device SABRE Lite, with an OV5642 sensor. The kernel is the Boundary devices kernel from the dora branch of Yocto.

 

Here's the results I get:

do_copy=0, do_rows=0: 1458.861001 cpu-clock

do_copy=0, do_rows=1: 2207.192003 cpu-clock

do_copy=1, do_rows=0: 624.319663 cpu-clock

do_copy=1, do_rows=1: 461.399335 cpu-clock

 

There's two strange things about these results. First, doing a memcpy makes things a lot faster. Second, when not doing a memcpy, row-wise traversal is much slower.

 

Does anyone know why this is happening, and how I can make it faster without doing a memcpy?

 

The kernel is allocating the buffers with dma_alloc_coherent. I've tried changing to kmalloc and dma_map_single/dma_unmap_single in the QBUF/DQBUF ioctls, but that made no difference.

 

Also note that if I run the same test code on my laptop, then I get the expected behaviour (row-wise traversal is faster, and memcpy is slower).

Original Attachment has been moved to: capture-test.c.zip

标签 (1)
0 项奖励
回复
3 回复数

1,247 次查看
erezsteinberg
Contributor IV

V4L2 buffers allocated with V4L2_MEMORY_MMAP are not cachable.

CPU accesses to such buffers are very slow (every access goes all the way to DDR).

You should use V4L2_MEMORY_USERPTR ...   But, I couldn't make it work.

Were you able to resolve the issue eventually?

Regards,

Erez

0 项奖励
回复

1,248 次查看
firex
Contributor I

did someone got a solution?
V4L2_MEMORY_MMAP very slow memcpy

V4L2_MEMORY_USERPTR don't work even if i allocate memory with memalign(page_size, framesize);
I'm using imx6ull. and mx6s_capture module
yes, i know it is old thread...

0 项奖励
回复

1,248 次查看
vladspiridonesc
Contributor II

Hey Philip,

Did you figured out what is the problem or what's happening? I also tried your code and it gives the same result on a iMX6 processor.

0 项奖励
回复