Hi everyone,
We are using a i.MX8MPlus to implement a smart camera solution and constantly are running into memory related performance issues when data comes from internal peripherals.
The problem occurs when trying to do something with raw image data from camera, but also when trying to use the VPU decoder for example. Since the later is more easy to isolate I made a little test program using the VPU decoder to illustrate the issue.
As I said performance issues occurs when the CPU is accessing output buffers that come from some IP in the SoC, in this case the VPU.
What I see is that using the data is extremely slow.
For example, converting the VPU output to a different YUV format for jpeg encoding takes ~740ms! To put this in perspective; the jpeg encoding on that same data after conversion only takes ~43ms. The YUV conversion is a simple copy loop that just puts bytes in a different order (NV16 -> YUV444).
the strange thing is that adding a 'memcpy' to the above experiment actually makes it a lot faster. When I first memcpy the VPU output data to a local buffer, and use that for the YUV conversion, the time needed drops from ~740ms to ~52ms! This still is way to much, but it is 14 times faster while I actually added code.
Purely for this scenario adding the memcpy is an acceptable 'workaround' for us as it makes the performance somewhat bearable. But we see this issue also in camera image processing where we need to make cut-outs from the raw image which as far as I know is not possible without CPU accessing the data.
Here is the output of the tool I wrote. It runs a trans-coding process from a H264 I-frame to a software encoded jpeg. It uses the NXP VPU-wrapper library to decode the frame on VPU:
Method 1: Directly use output buffer from VPU
Total transcode time : 828176 us
VPU decoding : 44217 us
Direct YUV conversion : 740691 us
Jpeg encoding : 43268 us
Method 2: Add memcpy to copy VPU output buffer to local buffer
Total transcode time : 128958 us
VPU decoding : 35270 us
Memcpy YUV conversion : 51094 us
Jpeg encoding : 42594 us
Method 3: Write VPU output to file and use the file to encode a jpeg
Total transcode time : 391426 us
VPU decoding : 35732 us
Store in file : 37746 us
Read back file : 275426 us
Jpeg encoding : 42522 us
The tool runs the transcoding process 3 times; without memcpy, with memcpy and via a file.
See attachment for the little tool I wrote. Inside is also the H264 I-frame I used for this test (h264.bin). The tool should be able to build with any SDK as long as it at least contains libjpeg and the vpu wrapper library.
Are there more people experiencing this kind of performance issues?
Is there a known way to prevent this kind of issues (besides zero-copy)?
Any other hint that may help us?
With kind regards,
Tim.
@TimBr
Hello,
Is the memory, allocated, say for VPU buffer(s), cached ?
Usually data buffers are not cached, if the coherency is taken into account.
Perhaps the problem occurs because the (VPU) buffer(s) is always "refreshed",
that is - always accessed by VPU, providing system bottleneck.
Regards,
Yuri.
Hi Yuri,
How can I see if a buffer is cached or not? Or better, how can I change it? In this case I get the buffer from a call to VPU_DecGetMem() in VPU-wrapper library. But in other cases where we see comparable behavior we get it trough V4L2 subsystem.
Regarding the 'refreshed by VPU' thing; we only encode one frame here and stop the process. So it seems unlikely to me that the VPU is still accessing it. Is there a way to be sure?
Regards,
Tim.