zero copy Gstreamer pipline CPU usage differences in 4.14 and 5.10.35 BSPs

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

zero copy Gstreamer pipline CPU usage differences in 4.14 and 5.10.35 BSPs

3,051 Views
r_trivedi123
Contributor IV

Hi,
I am working on the imx8qxpC0 revision board.
I have done migration of BSP from 4.14.98 to 5.10.35_2.0.0.

The camera and VPU are enabled and seem running fine.

My sensor driver is ported as it is from 4.14 to 5.10 (No configuration changes )

But while those same GStreamer pipelines I am seeing a huge change in CPU utilization.

There are two pipelines I am trying with DMAbuf(zero-copy) and without DMAbuf

With DMAbuf - gst-launch-1.0 v4l2src device=/dev/video1 io-mode=dmabuf-import ! video/x-raw, width=1280, height=720, framerate=30/1 ! v4l2h264enc output-io-mode=dmabuf ! filesink location=test1.h264

Without DMAbuf - gst-launch-1.0 v4l2src device=/dev/video1 ! video/x-raw, width=1280, height=720, framerate=30/1 ! v4l2h264enc ! filesink location=test1.h264

Below are my observations.

BSP L4.14.98CPU Load
DMA Buf used2%
DMA Buf not used6%

 

BSP L5.10.35CPU Load
DMA Buf used8.3%
DMA Buf not used58.0%


So, can you help in understanding why this change in CPU %?

--
Thanks & Regards,
Rutvij Trivedi

0 Kudos
12 Replies

3,016 Views
Zhiming_Liu
NXP TechSupport
NXP TechSupport

Hi @r_trivedi123 

 

I will reproduce this issue and give you feedback

 

Best Regards

Zhiming

0 Kudos

3,013 Views
r_trivedi123
Contributor IV

Hi @Zhiming_Liu,

Thanks you for the response and help.

Awaiting your repose.

--
Thanks

0 Kudos

2,993 Views
Zhiming_Liu
NXP TechSupport
NXP TechSupport

@r_trivedi123 

I have tested your command with  5.10.35 BSP on EVK

The max cpu load with dma is about 4%,most cpu load is about 3.3%

The max cpu load without dma is about 20%,most cpu load is about 12%

0 Kudos

2,968 Views
r_trivedi123
Contributor IV

Hi @Zhiming_Liu,
Thanks for reproducing and statistics.

That is strange, I will flash both of the BSPs and will test again if i missed anything.

--
Thanks
Rutvij Trivedi

0 Kudos

2,868 Views
r_trivedi123
Contributor IV

Hi @nxf65025,

I am successfully able to get low CPU load with the below gstreamer pipeline,

 

gst-launch-1.0 -v v4l2src device=/dev/video3 io-mode=dmabuf-import ! 'video/x-raw,format=(string)NV12,width=1280,height=720,framerate=(fraction)30/1' ! queue ! v4l2h264enc output-io-mode=dmabuf ! filesink location=test.h264

 

Obersevation(gst):-

- 30FPS and have CPU load it is 5-7% (from top utlitiy)
- In dmesg logs I can see logs of windsor driver (encoder)

As now I have moved to application part where I have got mxc_v4l2_vpu_enc.out application and using it as per below,

./mxc_v4l2_vpu_enc.out camera --key 0 --device /dev/video3 --size 1280 720 --u 4 --fmt nv12 --framerate 30 --framenum 90000 encoder --key 1 --source 0 --size 1280 720 --framerate 30 --bitrate 4194304 --lowlatency 0 ofile --key 2 --source 1 --name camera.h264


Obervations(with app):-
- 30FPS and have CPU load it is 100% (from top utlitiy)
- In logs I can see logs of windsor driver (encoder)

So now the queries are,
1. Why there is a difference in CPU% utlization in the case of GST Vs. Application ?
2. How can I improvise mxc_v4l2_vpu test app to lower down CPU usage

I have seen in the source that it uses V4L2_MEMORY_USERPTR so changing this to V4L2_MEMORY_DMABUF (as in the case of GST) will help ?

Can you advice ?

Any help would be greatly appreciable.

--
Thanks,
Rutvij

0 Kudos

2,820 Views
Zhiming_Liu
NXP TechSupport
NXP TechSupport

@r_trivedi123 

The gst we released have lots of optimizations.The vpu test demo doesn't use dmabuf,it uses mmap to get buffer.So this could cause CPU load difference.

0 Kudos

2,814 Views
r_trivedi123
Contributor IV

Hi @Zhiming_Liu 

Are there any application references available to achieve alike functionality?
Is there any sample application or document available for DMABUF implementation?

Currently, I am referring to

https://elinux.org/images/5/53/Zero-copy_video_streaming.pdf

and
https://www.kernel.org/doc/html/v4.9/media/uapi/v4l/dmabuf.html

Also is there any implementation documentation available for mxc_v4l2_vpu_test?

Also I have been through the GST source code, it has DMABUF as well as DMABUFF-IMPORT, can you suggest the difference between them ?

As GST optimized by NXP, any documentation or any suggestions are available?, Can you suggest those optimizations ?

Your help would be appreciated.

--
Thanks,
Rutvij

 

0 Kudos

2,804 Views
Zhiming_Liu
NXP TechSupport
NXP TechSupport

We don't have such documents, you can see the gst source code

gstreamer1.0-plugins-good/1.18.5.imx-r0/git/sys/v4l2/gstv4l2object.c: g_param_spec_enum ("output-io-mode", "Output IO mode"

 

0 Kudos

2,742 Views
r_trivedi123
Contributor IV

Hi @Zhiming_Liu ,

While working on this memory stuff, I did profiling using the below-attached sample_new.c source on 4.14 as well on 5.10.35 BSPs.

Here also I am seeing the difference,

On 4.14 BSP

 

 Strting copy global malloc...
TIME TAKEN = 0.064553
random prints dst = aa aa aa

Strting copy local malloc...
TIME TAKEN = 0.041085
random prints dst = aa aa aa 

 



On 5.10.35

 

 Strting copy global malloc..
TIME TAKEN = 0.223109
random prints dst = aa aa aa

Strting copy local malloc...
TIME TAKEN = 0.159377
random prints dst = aa aa aa

 


Timer profiling was done using

 

double cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;

 


This difference is creating performance impact while performing an operation after VIDIOC_DQBUF image in the actual code base and giving me only 5 FPS, if this is replaced with stack-allocated memory gives me 30 fps(expected).

So can you please advise why there is this difference in both of these BSPs?
Is possible to reproduce this at your end ?
I used board GCC after flashing BSPs.


--
Thanks,
Rutvij



0 Kudos

2,839 Views
r_trivedi123
Contributor IV

Hi @nxf65025,

Can you help please ?

0 Kudos

3,023 Views
r_trivedi123
Contributor IV

Hi, can I get any help over here, please?

--
Thanks

0 Kudos

3,029 Views
r_trivedi123
Contributor IV

Hi Folks,

Any updates ?

Any benchmark document available ?

 

--
Thanks

0 Kudos