zero copy Gstreamer pipline CPU usage differences in 4.14 and 5.10.35 BSPs

r_trivedi123 · ‎10-03-2021

Hi,
I am working on the imx8qxpC0 revision board.
I have done migration of BSP from 4.14.98 to 5.10.35_2.0.0.

The camera and VPU are enabled and seem running fine.

My sensor driver is ported as it is from 4.14 to 5.10 (No configuration changes )

But while those same GStreamer pipelines I am seeing a huge change in CPU utilization.

There are two pipelines I am trying with DMAbuf(zero-copy) and without DMAbuf

With DMAbuf - gst-launch-1.0 v4l2src device=/dev/video1 io-mode=dmabuf-import ! video/x-raw, width=1280, height=720, framerate=30/1 ! v4l2h264enc output-io-mode=dmabuf ! filesink location=test1.h264

Without DMAbuf - gst-launch-1.0 v4l2src device=/dev/video1 ! video/x-raw, width=1280, height=720, framerate=30/1 ! v4l2h264enc ! filesink location=test1.h264

Below are my observations.

BSP L4.14.98	CPU Load
DMA Buf used	2%
DMA Buf not used	6%

BSP L5.10.35	CPU Load
DMA Buf used	8.3%
DMA Buf not used	58.0%

So, can you help in understanding why this change in CPU %?

--
Thanks & Regards,
Rutvij Trivedi

Zhiming_Liu · ‎10-08-2021

Hi @r_trivedi123

I will reproduce this issue and give you feedback

Best Regards

Zhiming

r_trivedi123 · ‎10-08-2021

Hi @Zhiming_Liu,

Thanks you for the response and help.

Awaiting your repose.

--
Thanks

Zhiming_Liu · ‎10-11-2021

@r_trivedi123

I have tested your command with 5.10.35 BSP on EVK

The max cpu load with dma is about 4%,most cpu load is about 3.3%

The max cpu load without dma is about 20%,most cpu load is about 12%

r_trivedi123 · ‎10-12-2021

Hi @Zhiming_Liu,
Thanks for reproducing and statistics.

That is strange, I will flash both of the BSPs and will test again if i missed anything.

--
Thanks
Rutvij Trivedi

r_trivedi123 · ‎12-07-2021

Hi @nxf65025,

I am successfully able to get low CPU load with the below gstreamer pipeline,

gst-launch-1.0 -v v4l2src device=/dev/video3 io-mode=dmabuf-import ! 'video/x-raw,format=(string)NV12,width=1280,height=720,framerate=(fraction)30/1' ! queue ! v4l2h264enc output-io-mode=dmabuf ! filesink location=test.h264

Obersevation(gst):-

- 30FPS and have CPU load it is 5-7% (from top utlitiy)
- In dmesg logs I can see logs of windsor driver (encoder)

As now I have moved to application part where I have got mxc_v4l2_vpu_enc.out application and using it as per below,

./mxc_v4l2_vpu_enc.out camera --key 0 --device /dev/video3 --size 1280 720 --u 4 --fmt nv12 --framerate 30 --framenum 90000 encoder --key 1 --source 0 --size 1280 720 --framerate 30 --bitrate 4194304 --lowlatency 0 ofile --key 2 --source 1 --name camera.h264

Obervations(with app):-
- 30FPS and have CPU load it is 100% (from top utlitiy)
- In logs I can see logs of windsor driver (encoder)

So now the queries are,
1. Why there is a difference in CPU% utlization in the case of GST Vs. Application ?
2. How can I improvise mxc_v4l2_vpu test app to lower down CPU usage

I have seen in the source that it uses V4L2_MEMORY_USERPTR so changing this to V4L2_MEMORY_DMABUF (as in the case of GST) will help ?

Can you advice ?

Any help would be greatly appreciable.

--
Thanks,
Rutvij

Zhiming_Liu · ‎12-10-2021

@r_trivedi123

The gst we released have lots of optimizations.The vpu test demo doesn't use dmabuf,it uses mmap to get buffer.So this could cause CPU load difference.

r_trivedi123 · ‎12-10-2021

Hi @Zhiming_Liu

Are there any application references available to achieve alike functionality?
Is there any sample application or document available for DMABUF implementation?

Currently, I am referring to

https://elinux.org/images/5/53/Zero-copy_video_streaming.pdf

and
https://www.kernel.org/doc/html/v4.9/media/uapi/v4l/dmabuf.html

Also is there any implementation documentation available for mxc_v4l2_vpu_test?

Also I have been through the GST source code, it has DMABUF as well as DMABUFF-IMPORT, can you suggest the difference between them ?

As GST optimized by NXP, any documentation or any suggestions are available?, Can you suggest those optimizations ?

Your help would be appreciated.

--
Thanks,
Rutvij

Zhiming_Liu · ‎12-10-2021

We don't have such documents, you can see the gst source code

gstreamer1.0-plugins-good/1.18.5.imx-r0/git/sys/v4l2/gstv4l2object.c: g_param_spec_enum ("output-io-mode", "Output IO mode"

r_trivedi123 · ‎12-17-2021

Hi @Zhiming_Liu ,

While working on this memory stuff, I did profiling using the below-attached sample_new.c source on 4.14 as well on 5.10.35 BSPs.

Here also I am seeing the difference,

On 4.14 BSP

 Strting copy global malloc...
TIME TAKEN = 0.064553
random prints dst = aa aa aa

Strting copy local malloc...
TIME TAKEN = 0.041085
random prints dst = aa aa aa

On 5.10.35

 Strting copy global malloc..
TIME TAKEN = 0.223109
random prints dst = aa aa aa

Strting copy local malloc...
TIME TAKEN = 0.159377
random prints dst = aa aa aa

Timer profiling was done using

double cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;

This difference is creating performance impact while performing an operation after VIDIOC_DQBUF image in the actual code base and giving me only 5 FPS, if this is replaced with stack-allocated memory gives me 30 fps(expected).

So can you please advise why there is this difference in both of these BSPs?
Is possible to reproduce this at your end ?
I used board GCC after flashing BSPs.

--
Thanks,
Rutvij

r_trivedi123 · ‎12-09-2021

Hi @nxf65025,

Can you help please ?

r_trivedi123 · ‎10-07-2021

Hi, can I get any help over here, please?

--
Thanks

r_trivedi123 · ‎10-05-2021

Hi Folks,

Any updates ?

Any benchmark document available ?

--
Thanks