Hi,
I am working on the imx8qxpC0 revision board.
I have done migration of BSP from 4.14.98 to 5.10.35_2.0.0.
The camera and VPU are enabled and seem running fine.
My sensor driver is ported as it is from 4.14 to 5.10 (No configuration changes )
But while those same GStreamer pipelines I am seeing a huge change in CPU utilization.
There are two pipelines I am trying with DMAbuf(zero-copy) and without DMAbuf
With DMAbuf - gst-launch-1.0 v4l2src device=/dev/video1 io-mode=dmabuf-import ! video/x-raw, width=1280, height=720, framerate=30/1 ! v4l2h264enc output-io-mode=dmabuf ! filesink location=test1.h264
Without DMAbuf - gst-launch-1.0 v4l2src device=/dev/video1 ! video/x-raw, width=1280, height=720, framerate=30/1 ! v4l2h264enc ! filesink location=test1.h264
Below are my observations.
BSP L4.14.98 | CPU Load |
DMA Buf used | 2% |
DMA Buf not used | 6% |
BSP L5.10.35 | CPU Load |
DMA Buf used | 8.3% |
DMA Buf not used | 58.0% |
So, can you help in understanding why this change in CPU %?
--
Thanks & Regards,
Rutvij Trivedi
Hi @Zhiming_Liu,
Thanks you for the response and help.
Awaiting your repose.
--
Thanks
I have tested your command with 5.10.35 BSP on EVK
The max cpu load with dma is about 4%,most cpu load is about 3.3%
The max cpu load without dma is about 20%,most cpu load is about 12%
Hi @Zhiming_Liu,
Thanks for reproducing and statistics.
That is strange, I will flash both of the BSPs and will test again if i missed anything.
--
Thanks
Rutvij Trivedi
Hi @nxf65025,
I am successfully able to get low CPU load with the below gstreamer pipeline,
gst-launch-1.0 -v v4l2src device=/dev/video3 io-mode=dmabuf-import ! 'video/x-raw,format=(string)NV12,width=1280,height=720,framerate=(fraction)30/1' ! queue ! v4l2h264enc output-io-mode=dmabuf ! filesink location=test.h264
Obersevation(gst):-
- 30FPS and have CPU load it is 5-7% (from top utlitiy)
- In dmesg logs I can see logs of windsor driver (encoder)
As now I have moved to application part where I have got mxc_v4l2_vpu_enc.out application and using it as per below,
./mxc_v4l2_vpu_enc.out camera --key 0 --device /dev/video3 --size 1280 720 --u 4 --fmt nv12 --framerate 30 --framenum 90000 encoder --key 1 --source 0 --size 1280 720 --framerate 30 --bitrate 4194304 --lowlatency 0 ofile --key 2 --source 1 --name camera.h264
Obervations(with app):-
- 30FPS and have CPU load it is 100% (from top utlitiy)
- In logs I can see logs of windsor driver (encoder)
So now the queries are,
1. Why there is a difference in CPU% utlization in the case of GST Vs. Application ?
2. How can I improvise mxc_v4l2_vpu test app to lower down CPU usage
I have seen in the source that it uses V4L2_MEMORY_USERPTR so changing this to V4L2_MEMORY_DMABUF (as in the case of GST) will help ?
Can you advice ?
Any help would be greatly appreciable.
--
Thanks,
Rutvij
The gst we released have lots of optimizations.The vpu test demo doesn't use dmabuf,it uses mmap to get buffer.So this could cause CPU load difference.
Hi @Zhiming_Liu
Are there any application references available to achieve alike functionality?
Is there any sample application or document available for DMABUF implementation?
Currently, I am referring to
https://elinux.org/images/5/53/Zero-copy_video_streaming.pdf
and
https://www.kernel.org/doc/html/v4.9/media/uapi/v4l/dmabuf.html
Also is there any implementation documentation available for mxc_v4l2_vpu_test?
Also I have been through the GST source code, it has DMABUF as well as DMABUFF-IMPORT, can you suggest the difference between them ?
As GST optimized by NXP, any documentation or any suggestions are available?, Can you suggest those optimizations ?
Your help would be appreciated.
--
Thanks,
Rutvij
We don't have such documents, you can see the gst source code
gstreamer1.0-plugins-good/1.18.5.imx-r0/git/sys/v4l2/gstv4l2object.c: g_param_spec_enum ("output-io-mode", "Output IO mode"
Hi @Zhiming_Liu ,
While working on this memory stuff, I did profiling using the below-attached sample_new.c source on 4.14 as well on 5.10.35 BSPs.
Here also I am seeing the difference,
On 4.14 BSP
Strting copy global malloc...
TIME TAKEN = 0.064553
random prints dst = aa aa aa
Strting copy local malloc...
TIME TAKEN = 0.041085
random prints dst = aa aa aa
On 5.10.35
Strting copy global malloc..
TIME TAKEN = 0.223109
random prints dst = aa aa aa
Strting copy local malloc...
TIME TAKEN = 0.159377
random prints dst = aa aa aa
Timer profiling was done using
double cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
This difference is creating performance impact while performing an operation after VIDIOC_DQBUF image in the actual code base and giving me only 5 FPS, if this is replaced with stack-allocated memory gives me 30 fps(expected).
So can you please advise why there is this difference in both of these BSPs?
Is possible to reproduce this at your end ?
I used board GCC after flashing BSPs.
--
Thanks,
Rutvij
Hi, can I get any help over here, please?
--
Thanks
Hi Folks,
Any updates ?
Any benchmark document available ?
--
Thanks