About G2D performance with resizing and color space conversion in i.MX6DQ.

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

About G2D performance with resizing and color space conversion in i.MX6DQ.

1,207 Views
keitanagashima
Senior Contributor I

Dear All,

My customer measured the below performance and the result looked bad.

==Test Case==

VPU --> [1080p@30, YUV(NV12)] --> GPU --> [720p@30, RGBA8888] --> Frame buffer

BSP: L3.0.35_1.1.0

==Result==

The bottleneck was seen on GPU processing with resizing and color space conversion.

It takes about 70 msec!

[Questions]

Q1. Could you tell me the performance of GPU2D (theoretical or calculation value) on above case?

  (Ex, ??ms/frame)

Q2. Do you have any workaround to improve the performance?

Best Regards,

Keita

Labels (3)
0 Kudos
4 Replies

766 Views
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Hi Keita,

Here are some performance test numbers,

PlatformSource
(RGBA, NV12 for CSC)
DestinationFillrectBitbltStretch bltFilterbltFilterblt CSC
MX6DQ 2D CLOCK: 455M, AXI:266M, DDR: 528Mx128bit2048x20481024x768 RGB565526.39 Mpixels/s
(1052.79 Mbytes/s)
517.39 Mpixels/s
(1034.78 Mbytes/s)
read fill rate
1003.42 Mpixels/s
(4013.69 Mbytes/s), write fill rate
188.14 Mpixels/s
(376.28 Mbytes/s)
read fill rate
150.83 Mpixels/s
(603.32 Mbytes/s), write fill rate
28.28 Mpixels/s
(56.56 Mbytes/s)
read fill rate
239.76 Mpixels/s
(359.64 Mbytes/s),
write fill rate
44.95 Mpixels/s
(89.91 Mbytes/s)
MX6DQ 2D CLOCK: 455M, AXI:266M, DDR: 528Mx128bit1280x7201024x768 RGB565526.39 Mpixels/s
(1052.79 Mbytes/s)
522.55 Mpixels/s
(1045.09 Mbytes/s)
read fill rate
221.01 Mpixels/s
(884.03 Mbytes/s), write fill rate
188.59 Mpixels/s
(377.19 Mbytes/s)
read fill rate
94.46 Mpixels/s
(377.86 Mbytes/s), write fill rate
80.61 Mpixels/s
(161.22 Mbytes/s)
read fill rate
128.64 Mpixels/s
(192.96 Mbytes/s), write fill rate
109.78 Mpixels/s
(219.55 Mbytes/s)

Hope this helps

0 Kudos

766 Views
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Hi Keita,

Normally, the limitation comes from many factors, e.g., DDR bandwidth, encoder HW limitation, display capability, video content, etc. For example, the maximum limitation for iMX6Q VPU alone should be able to do dual channel 1080p@30fps for some of the video clips at VPU=264MHz. However, we cannot claim it can support dual 1080p@30fps decoding because it cannot decode dual channel blu-ray clips which may have the maximum bitrate up to 40MHz for each channel. In addition, even for normal video clips,  may still have problem to do dual channel but depends on whether  other post-processing e.g., resize, or deinterlacing for interlaced video,  is needed.

Normally you can analyze the VPU capability simply by looking at the pixel rate for VPU alone for the same HWimplementation at the same platform. For the same video encoding or decoding,  different HW implementation requires different memory bandwidth. Decoder: iMX6Q should be able to do more than 1080p@30fps+D1@30fps at the default clock rate. it may be able to do dual channel 1080p@30fps for some of the video clips without any additional resizing, and it is capable to do 8xD1@30fps without some crazy resizing (e.g., D1->1080p for each channel

Encoder: iMX6Q VPU encoder also can do more than 1080p@30fps. The unit test result shows it can do 1080p@35fps at the default frequency. However, we normally quantize it to 30fps. It is able to do 6XD1

However for your case, It does not depend. You have NV12 Because VPU takes as input data only YUV420 format. So any camera format should be converted to YUV420.

For the FB time estimations, I look into section 34 in the Reference manual.

(1) Scaling and IP conversion from 1080@60i to 720@60p

[] de-interlace and scale can be done on-the-fly, the data path is such as MEM->IPU VDI->IPU Image Conversion(for scaling)->MEM. 3 consequential fields are used to generate one output frame. Suppose the data format is YUV420.

The total bandwidth could be calculated as (960*1080*1.5(one field)*3 + 1280*720*1.5(one frame output)) * 60fps = 346MB

In order to save the bandwidth an alternative way for de-interlacing is to use only one field to get the output frame. Then the total bus loading could be

(960*1080*1.5(one field)*1 + 1280*720*1.5(one frame output)) * 60fps = 168MB

(2) H.264 encoding with 760@60p input, IP flame only and 20Mbps output.

[] the encoding will need about 3~3.5x memory access. That means to output also 720p@60fps streams the total band width could be

1280*720*1.5*3.5*60fps = 277Mbytes.

Currently there is no way to determine the size in OpenGL API. Pseudo way of calculating by its size/bpp is possible, but it will be accurate as need to consider 4K alignment required by GPU, memory fragmentation etc…

Regards

0 Kudos

766 Views
keitanagashima
Senior Contributor I

Hi Bio_TICFSL,

Do you have any update?

Best Regards,

Keita

0 Kudos

766 Views
keitanagashima
Senior Contributor I

Hi Bio_TICFSL,

Thank you for your information!

But, the bottleneck (70 msec) was measured in before and after calling the g2d_blit.

Therefore, we would like to know the G2D performance (not VPU performance).

Best Regards,

Keita

0 Kudos