About G2D performance with resizing and color space conversion in i.MX6DQ.

keitanagashima · ‎05-23-2016

Dear All,

My customer measured the below performance and the result looked bad.

==Test Case==

VPU --> [1080p@30, YUV(NV12)] --> GPU --> [720p@30, RGBA8888] --> Frame buffer

BSP: L3.0.35_1.1.0

==Result==

The bottleneck was seen on GPU processing with resizing and color space conversion.

It takes about 70 msec!

[Questions]

Q1. Could you tell me the performance of GPU2D (theoretical or calculation value) on above case?

(Ex, ??ms/frame)

Q2. Do you have any workaround to improve the performance?

Best Regards,

Keita

Bio_TICFSL · ‎05-31-2016

Hi Keita,

Here are some performance test numbers,

Platform

Source
(RGBA, NV12 for CSC)

Destination

Fillrect

Bitblt

Stretch blt

Filterblt

Filterblt CSC

MX6DQ 2D CLOCK: 455M, AXI:266M, DDR: 528Mx128bit	2048x2048	1024x768 RGB565	526.39 Mpixels/s (1052.79 Mbytes/s)	517.39 Mpixels/s (1034.78 Mbytes/s)	read fill rate 1003.42 Mpixels/s (4013.69 Mbytes/s), write fill rate 188.14 Mpixels/s (376.28 Mbytes/s)	read fill rate 150.83 Mpixels/s (603.32 Mbytes/s), write fill rate 28.28 Mpixels/s (56.56 Mbytes/s)	read fill rate 239.76 Mpixels/s (359.64 Mbytes/s), write fill rate 44.95 Mpixels/s (89.91 Mbytes/s)
MX6DQ 2D CLOCK: 455M, AXI:266M, DDR: 528Mx128bit	1280x720	1024x768 RGB565	526.39 Mpixels/s (1052.79 Mbytes/s)	522.55 Mpixels/s (1045.09 Mbytes/s)	read fill rate 221.01 Mpixels/s (884.03 Mbytes/s), write fill rate 188.59 Mpixels/s (377.19 Mbytes/s)	read fill rate 94.46 Mpixels/s (377.86 Mbytes/s), write fill rate 80.61 Mpixels/s (161.22 Mbytes/s)	read fill rate 128.64 Mpixels/s (192.96 Mbytes/s), write fill rate 109.78 Mpixels/s (219.55 Mbytes/s)

Hope this helps

Bio_TICFSL · ‎05-24-2016

Hi Keita,

Normally, the limitation comes from many factors, e.g., DDR bandwidth, encoder HW limitation, display capability, video content, etc. For example, the maximum limitation for iMX6Q VPU alone should be able to do dual channel 1080p@30fps for some of the video clips at VPU=264MHz. However, we cannot claim it can support dual 1080p@30fps decoding because it cannot decode dual channel blu-ray clips which may have the maximum bitrate up to 40MHz for each channel. In addition, even for normal video clips, may still have problem to do dual channel but depends on whether other post-processing e.g., resize, or deinterlacing for interlaced video, is needed.

Normally you can analyze the VPU capability simply by looking at the pixel rate for VPU alone for the same HWimplementation at the same platform. For the same video encoding or decoding, different HW implementation requires different memory bandwidth. Decoder: iMX6Q should be able to do more than 1080p@30fps+D1@30fps at the default clock rate. it may be able to do dual channel 1080p@30fps for some of the video clips without any additional resizing, and it is capable to do 8xD1@30fps without some crazy resizing (e.g., D1->1080p for each channel

Encoder: iMX6Q VPU encoder also can do more than 1080p@30fps. The unit test result shows it can do 1080p@35fps at the default frequency. However, we normally quantize it to 30fps. It is able to do 6XD1

However for your case, It does not depend. You have NV12 Because VPU takes as input data only YUV420 format. So any camera format should be converted to YUV420.

For the FB time estimations, I look into section 34 in the Reference manual.

(1) Scaling and IP conversion from 1080@60i to 720@60p

[] de-interlace and scale can be done on-the-fly, the data path is such as MEM->IPU VDI->IPU Image Conversion(for scaling)->MEM. 3 consequential fields are used to generate one output frame. Suppose the data format is YUV420.

The total bandwidth could be calculated as (960*1080*1.5(one field)*3 + 1280*720*1.5(one frame output)) * 60fps = 346MB

In order to save the bandwidth an alternative way for de-interlacing is to use only one field to get the output frame. Then the total bus loading could be

(960*1080*1.5(one field)*1 + 1280*720*1.5(one frame output)) * 60fps = 168MB

(2) H.264 encoding with 760@60p input, IP flame only and 20Mbps output.

[] the encoding will need about 3~3.5x memory access. That means to output also 720p@60fps streams the total band width could be

1280*720*1.5*3.5*60fps = 277Mbytes.

Currently there is no way to determine the size in OpenGL API. Pseudo way of calculating by its size/bpp is possible, but it will be accurate as need to consider 4K alignment required by GPU, memory fragmentation etc…

Regards

keitanagashima · ‎05-30-2016

Hi Bio_TICFSL,

Do you have any update?

Best Regards,

Keita

keitanagashima · ‎05-26-2016

Hi Bio_TICFSL,

Thank you for your information!

But, the bottleneck (70 msec) was measured in before and after calling the g2d_blit.

Therefore, we would like to know the G2D performance (not VPU performance).

Best Regards,

Keita

About G2D performance with resizing and color space conversion in i.MX6DQ.

About G2D performance with resizing and color space conversion in i.MX6DQ.

i.MX6Dual

i.MX6Quad

Linux