I'm seeing some strange performance behaviour when processing images captured with V4L2. I've reduced it to the following test case (I've attached the full source code):
int process_image(const unsigned char *p, int size)
{
int i, j, sum;
if (do_copy) {
memcpy(copy_buf, p, size);
p = copy_buf;
}
sum = 0;
if (do_rows) {
for (i = 0; i < size; i++)
sum |= p[i];
} else {
/* Read all the pixels in non-optimal order */
for (i = 0; i < fmt.fmt.pix.bytesperline; i++) {
for (j = 0; j < fmt.fmt.pix.height; j++) {
sum |= p[i + j * fmt.fmt.pix.bytesperline];
}
}
}
}
I'm measuring the performance using 'perf stat -e cpu-clock ./capture-test -c 100', running on a Boundary Device SABRE Lite, with an OV5642 sensor. The kernel is the Boundary devices kernel from the dora branch of Yocto.
Here's the results I get:
do_copy=0, do_rows=0: 1458.861001 cpu-clock
do_copy=0, do_rows=1: 2207.192003 cpu-clock
do_copy=1, do_rows=0: 624.319663 cpu-clock
do_copy=1, do_rows=1: 461.399335 cpu-clock
There's two strange things about these results. First, doing a memcpy makes things a lot faster. Second, when not doing a memcpy, row-wise traversal is much slower.
Does anyone know why this is happening, and how I can make it faster without doing a memcpy?
The kernel is allocating the buffers with dma_alloc_coherent. I've tried changing to kmalloc and dma_map_single/dma_unmap_single in the QBUF/DQBUF ioctls, but that made no difference.
Also note that if I run the same test code on my laptop, then I get the expected behaviour (row-wise traversal is faster, and memcpy is slower).
Original Attachment has been moved to: capture-test.c.zip