Dear All,
My customer measured the below performance and the result looked bad.
==Test Case==
VPU --> [1080p@30, YUV(NV12)] --> GPU --> [720p@30, RGBA8888] --> Frame buffer
BSP: L3.0.35_1.1.0
==Result==
The bottleneck was seen on GPU processing with resizing and color space conversion.
It takes about 70 msec!
[Questions]
Q1. Could you tell me the performance of GPU2D (theoretical or calculation value) on above case?
(Ex, ??ms/frame)
Q2. Do you have any workaround to improve the performance?
Best Regards,
Keita
Hi Keita,
Here are some performance test numbers,
Platform | Source (RGBA, NV12 for CSC) | Destination | Fillrect | Bitblt | Stretch blt | Filterblt | Filterblt CSC |
MX6DQ 2D CLOCK: 455M, AXI:266M, DDR: 528Mx128bit | 2048x2048 | 1024x768 RGB565 | 526.39 Mpixels/s (1052.79 Mbytes/s) | 517.39 Mpixels/s (1034.78 Mbytes/s) | read fill rate 1003.42 Mpixels/s (4013.69 Mbytes/s), write fill rate 188.14 Mpixels/s (376.28 Mbytes/s) | read fill rate 150.83 Mpixels/s (603.32 Mbytes/s), write fill rate 28.28 Mpixels/s (56.56 Mbytes/s) | read fill rate 239.76 Mpixels/s (359.64 Mbytes/s), write fill rate 44.95 Mpixels/s (89.91 Mbytes/s) |
MX6DQ 2D CLOCK: 455M, AXI:266M, DDR: 528Mx128bit | 1280x720 | 1024x768 RGB565 | 526.39 Mpixels/s (1052.79 Mbytes/s) | 522.55 Mpixels/s (1045.09 Mbytes/s) | read fill rate 221.01 Mpixels/s (884.03 Mbytes/s), write fill rate 188.59 Mpixels/s (377.19 Mbytes/s) | read fill rate 94.46 Mpixels/s (377.86 Mbytes/s), write fill rate 80.61 Mpixels/s (161.22 Mbytes/s) | read fill rate 128.64 Mpixels/s (192.96 Mbytes/s), write fill rate 109.78 Mpixels/s (219.55 Mbytes/s) |
Hope this helps
Hi Keita,
Normally, the limitation comes from many factors, e.g., DDR bandwidth, encoder HW limitation, display capability, video content, etc. For example, the maximum limitation for iMX6Q VPU alone should be able to do dual channel 1080p@30fps for some of the video clips at VPU=264MHz. However, we cannot claim it can support dual 1080p@30fps decoding because it cannot decode dual channel blu-ray clips which may have the maximum bitrate up to 40MHz for each channel. In addition, even for normal video clips, may still have problem to do dual channel but depends on whether other post-processing e.g., resize, or deinterlacing for interlaced video, is needed.
Normally you can analyze the VPU capability simply by looking at the pixel rate for VPU alone for the same HWimplementation at the same platform. For the same video encoding or decoding, different HW implementation requires different memory bandwidth. Decoder: iMX6Q should be able to do more than 1080p@30fps+D1@30fps at the default clock rate. it may be able to do dual channel 1080p@30fps for some of the video clips without any additional resizing, and it is capable to do 8xD1@30fps without some crazy resizing (e.g., D1->1080p for each channel
Encoder: iMX6Q VPU encoder also can do more than 1080p@30fps. The unit test result shows it can do 1080p@35fps at the default frequency. However, we normally quantize it to 30fps. It is able to do 6XD1
However for your case, It does not depend. You have NV12 Because VPU takes as input data only YUV420 format. So any camera format should be converted to YUV420.
For the FB time estimations, I look into section 34 in the Reference manual.
(1) Scaling and IP conversion from 1080@60i to 720@60p
[] de-interlace and scale can be done on-the-fly, the data path is such as MEM->IPU VDI->IPU Image Conversion(for scaling)->MEM. 3 consequential fields are used to generate one output frame. Suppose the data format is YUV420.
The total bandwidth could be calculated as (960*1080*1.5(one field)*3 + 1280*720*1.5(one frame output)) * 60fps = 346MB
In order to save the bandwidth an alternative way for de-interlacing is to use only one field to get the output frame. Then the total bus loading could be
(960*1080*1.5(one field)*1 + 1280*720*1.5(one frame output)) * 60fps = 168MB
(2) H.264 encoding with 760@60p input, IP flame only and 20Mbps output.
[] the encoding will need about 3~3.5x memory access. That means to output also 720p@60fps streams the total band width could be
1280*720*1.5*3.5*60fps = 277Mbytes.
Currently there is no way to determine the size in OpenGL API. Pseudo way of calculating by its size/bpp is possible, but it will be accurate as need to consider 4K alignment required by GPU, memory fragmentation etc…
Regards
Hi Bio_TICFSL,
Do you have any update?
Best Regards,
Keita
Hi Bio_TICFSL,
Thank you for your information!
But, the bottleneck (70 msec) was measured in before and after calling the g2d_blit.
Therefore, we would like to know the G2D performance (not VPU performance).
Best Regards,
Keita