Is IPU really that slow on mx53?

tselmeci · ‎11-21-2013

Hi all!

Another issue while developing IPTV player on MX53QSB...

I'm using libipu.so to convert decoder's output and to display it on the framebuffer. So the configuration is the following:

- ipu_lib_input_param_t: input width/height is the decoded frame's w/h, colorspace is YUV420, crop win is the entire frame;

- ipu_lib_overlay_param_t: no overlay is configured;

- ipu_lib_output_param_t: width/height is the output's w/h (scales up, and now I don't care the aspect ratio), output window's w/h matches the framebuffer's w/h;

mxc_ipu_lib_task_init(...) is being used in OP_STREAM_MODE.

mxc_ipu_lib_task_buf_update(...) is being used to display the decoded frame. This call receives the physical address where the decoder (vpu_DecStartOneFrame(...)) has put the most recent frame, so I don't memcpy(...) anything, just use the framebuffer allocted for the VPU decoder.

Results:

1) If I connect the Seiko LCD (800x480), decode full HD resolution video (1920x1080), a call to mxc_ipu_lib_task_buf_update(...) takes about 1-2 ms, which is excellent considering the fact the IPU has to convert the colorspace and scale the input down. Similarly, if 720x576 video is being played, the displaying times still remain lower than 1-2 ms;

2) If I connect an HD TV, use it through SII902x HDMI interface in 1920x1080 resolution, a call to mxc_ipu_lib_task_buf_update(...) takes about 45-50 ms. It's very long, since it limits the video framerate to ~20fps, which is far more from adequate. No matter if the input image is either 720x576 or 1920x1080, displaying is always slow like this;

In case of HD output and RGB565 framebuffer format, the data amount display on the LCD is ~768000 bytes, while on the HD TV ~4147000. The ratio is ~5.4, but IPU is far slower than this. What can be the reason? Out of bandwidth or CPU horsepower?

I've had a try with TVE output, got the same results.

How can I achieve 30 FPS in HD resolution with HD?

Any hints appreciated, thanks, regards,

qiang_li-mpu_se · ‎12-16-2013

Copy the answer from CRM SR to here.

The buffer allocated with dma_alloc_coherent() is phycial continued memory in noncach write buffer mode. So the performance is poor, but it can make sure the data in memory is always correct. It is needed for hardware using. In this case, you should avoid memory copy, just let the hardware to use the buffer base address directly.

"malloc" can't be used in such case, because it can't make sure the memory is phycial continued.

Another choice, you can also use pgprot_writethru() to allocated phycial continued memory in cached write through mode. This can improve the memory copy performance, but before hardware using it, you should sync the data between cache and memory manually with api dma_sync_single_for_cpu() and dma_sync_single_for_device().

View solution in original post

qiang_li-mpu_se · ‎12-03-2013

Hi Tamas,
The issue should be caused by the IPU IC module, the IC module has limitation for ouput resolution, width and height are less than 1024 pixels.

For normal case, after VPU decoded the video file, the IPU lib or V4l2 output driver can be used to resize and CSC to fit the output display, the IPU IC module was used for resize and CSC, and when the display resolution is less than 1024x1024, such as 800x480 display, it is OK, the IC can finish the resize and CSC in one task. When your display is 1920x1080, both width and height are bigger than 1024 pixels, in this case, the IC will use four tasks for each frame, the whole output 1920x1080 frame will be splited into four regions, IC needs for task to resize or CSC each region, this will impact the performance.

When you play 1080P video on 1080P display, and put the video in fb2 with YUV format, in this case, it can skip IC module to get better performance, no resize, rotation or CSC. To check this mode, if you use ipu lib for video render, you can add debug message in ipu lib file "mxc_ipu_hl_lib.c", function _ipu_task_check(), "if (ipu_priv_handle->output.task_mode == NULL_MODE)", when this code had been run, it means IC had been bypassed for resize and CSC.

If you use v4l2 output for video render, the kernel file "mxc_v4l2_output.c", function mxc_v4l2out_streamon(), "dev_info(dev, "Bypassing IC\n");" had been run, it also means IC had been bypassed.

When you render video to YUV format on fb2, the overlay, even if your display is RGB format, the IPU DP can do the CSC before sent to display. To get better performace, you't better bypass the IC module, or only do the resize in IC; resize + CSC will cost more time in IPU IC.

tselmeci · ‎12-04-2013

Hello Qiang!

Thanks for the reply.

At the moment I can easily give up image resizing, it's not that important.

So this is what you suggest me to do:

-1) decode the video frame (whatever size it has);

-1.1) the decoded frame resides in a buffer with HW address;

-1.2) the frame is in YUV format;

-2) ask IPU to copy the data to /dev/fb2;

-2.1) is fb2 always in YUV mode?

-2.2) what if fb2's resolution is smaller than the video frame's res?

-2.3) this involves the IPU can't be used in OP_STREAM_MODE (since it uses three fbs), it must be used in OP_NORMAL_MODE and be ipu_lib_output_param_t.fb_disp.fb_num = 2, is it correct?

-2.4) using only one framebuffer causes screen flickering, which is automatically avoided in OP_STREAM_MODE (auto panning);

Are the above correct?

If IPU is only used to copy data from a DMA buffer to another buffer, why I shouldn't use libpxp instead? Its input can be the decoded frame's address, output is the HW address of fb2. Perhaps simpler than using IPU...

tselmeci · ‎12-09-2013

It seems that only fb0 is visible, or at least transparency/alpha/color keying should be set on fb0/fb2 (background, foreground). I don't, it will turn out later.

Here's how to set /dev/fb0 to YUV420P mode:

if (ioctl(mstate.fb_handle, FBIOGET_VSCREENINFO, &(mstate.fb_vsi)) < 0)

return -1;

mstate.fb_vsi.xres = DISPLAY_RES_X;

mstate.fb_vsi.yres = DISPLAY_RES_Y;

mstate.fb_vsi.xres_virtual = DISPLAY_RES_X;

mstate.fb_vsi.yres_virtual = DISPLAY_RES_Y * 3;

mstate.fb_vsi.activate |= FB_ACTIVATE_FORCE;

mstate.fb_vsi.nonstd = IPU_PIX_FMT_YUV420P;

mstate.fb_vsi.bits_per_pixel = 12;

if (ioctl(mstate.fb_handle, FBIOPUT_VSCREENINFO, &(mstate.fb_vsi)) < 0)

return -1;

qiang_li-mpu_se · ‎12-09-2013

Hi Tamas,

I think you need reference to the sample code in "imx-test-11.09.01\test\mxc_vpu_test\display.c".

That sample code will output VPU decoded video to display (fb0, fb1, fb2) with ipu lib or the v4l2 output driver, for normal case, we will always put video on overlay (output.show_to_fb = 1; output.fb_disp.fb_num = 2;). Then the ipu lib will enable the fb2 and set the fb2 format based on "output" parameters.

iMX53 has no pxp, so libpxp can't work on it.

tselmeci · ‎12-10-2013

Well, I've figured out that I don't need IPU at all.

The framebuffer has been set to YUV420 mode, so just copying the frame data generated by VPU does the job. And it works!

There's only one terrible drawback.

VPU's output framebuffers are allocated by IOGetPhyMem(...), which allocates DMA buffer (dma_alloc_coherent). To have access to these buffers in userspace, IOGetVirtMem must be used, and yes, I can access the decoded frames using the provided virtual address. The terrible drawback comes here: memcpy(...) on this virtual address is unacceptable slow. This entire discussion is about this.

This is what IPU also assumed to do in my case, simply copying the frame data to the framebuffer.

Please help me to feed the framebuffer with YUV420P data fast enough. I don't need anything else, just want to have direct access to framebuffer at a reasonable speed. IOGetVirtMem creates an awfully slow mapping. Furthermore, the memory allocated by dma_alloc_coherent(...) is itself also very slow, I've tested in kernel code.

If I malloc(...) a custom userspace buffer and memcpy(..) it over to framebuffer's mmap'd address, everything is fast and works like a charm. But VPU puts the decoded frame into a DMA buffer, which is slow. If there's a trick which will force the VPU to put decoded frame into userspace buffer, then I'd prefer that.

qiang_li-mpu_se · ‎12-16-2013

Copy the answer from CRM SR to here.

The buffer allocated with dma_alloc_coherent() is phycial continued memory in noncach write buffer mode. So the performance is poor, but it can make sure the data in memory is always correct. It is needed for hardware using. In this case, you should avoid memory copy, just let the hardware to use the buffer base address directly.

"malloc" can't be used in such case, because it can't make sure the memory is phycial continued.

Another choice, you can also use pgprot_writethru() to allocated phycial continued memory in cached write through mode. This can improve the memory copy performance, but before hardware using it, you should sync the data between cache and memory manually with api dma_sync_single_for_cpu() and dma_sync_single_for_device().

tselmeci · ‎12-16-2013

During the development I've done several experiments and learned how ARM and Linux DMA works. It turned out that on ARM memory can be uncached and/or unbuffered. For DMA, the uncached method is the preferred one, since it means the DMA memory region doesn't take part in the caching mechanism, so it can't happen that the memory contents change and any of the DMA "peers" won't get notified about it. Roughly :smileywink: It's the reasony why both IPU and VPU use dma_alloc_coherent everywhere. By the way, this seems to be the appropriate way until the whole image decoding, processing and displaying chain can be executed by the HW. This is the case for decoding SD MPEG2, converting to RGB565 and displaying it on an LCD. It puts insignificant load to the system.

My biggest concern is that when it comes to decoding HD resolution H264, colorspace converting and displaying with IPU, the performance falls back tragically and with "traditional" tools and libs provided by Freescale to iMX53QSB, it's not possible to play back HD movie at 25-30 FPS. Investigating this issue the DMA memory has revealed its nature: well, it's very slow when a user space application wants to read/write from/to it, it's intended only for HW use. However I've found some ways to speed it up a bit (including your proposal above), the IPU performance is still an ubreakable bottleneck, this is for sure, and you have also admitted this in an email responding to one of my SRs.

I've finally found the proper way to decode and play back HD H264 in realtime, but it's very far away from the principles introduced in IPU library documentation - to be honest, IPU is completely discarded, I've had to apply some tricks. The funny thing is that now HD resolution playback works, but smaller resolutions are still suffering from other (IPU) issues. Hopefully I've also found the cure for those issues, but they're absolutely not documented and required a lot of hours investigating the MXC framebuffer, IPU and VPU module kernel sources.

tselmeci · ‎11-27-2013

I've got some partial results.

1) memcpy between two malloc'd buffers: high copy speed;

2) memcpy between a malloc'd buffer and an IOGetPhyMem/IOGetVirtMem'd buffer: slow;

Inspecting the source codes of imx-lib and the kernel (2.6.35.3) mxc module, it turns out that using IOGetPhyMem allocates memory in the DMA region and IOGetVirtMem just simply mmaps the area in DMA-region to userspace using the file descriptor /dev/mxc_vpu. I'm not too familiar with mmap's internals, but I suppose when memcpy wants to access this buffer, it uses the virtual address, which must be converted to physical address, this may involve MMU operations, there's a file descriptor, VFS, etc. so I can imagine it's slow.

However, I saw in Freescale's VPU-test sources that they also use IOGetPhyMem/IOGetVirtMem for IPU buffers.

Can the DMA region itself be so slow?

Can the mmap make access to the allocated region so slow?

BrilliantovKiri · ‎11-27-2013

Hello, Tamas!

In my programm I use only physical addresses: buffer from sensor -> ipu_lib_input_param_t.user_def_paddr[0](IPU input) - ipu_lib_output_param_t.user_def_paddr[0] (IPU output) -> vpu_mem_desc.phy_addr (VPU source)

If I use memcpy to virtual addresses performance decreases.

BrilliantovKiri · ‎11-27-2013

>

>Is IPU really that slow on mx53?

>reply from Tamas Selmeci in i.MX Community - View the full discussion

tselmeci · ‎11-27-2013

Update:

I've patched the kernel to do some measurements.

1) the memory region, which is allocated by IOGetPhyMem (which calls kernel's VPU driver, which uses dma_alloc_coherent(...)) has a bandwidth of 80-90 MB/s;

2) if I allocate 1 MB in the kernel with kmalloc, its bandwidth is 3-400 MB/s;

Does anybody have any idea why the memory area from DMA region is so slow?

BrilliantovKiri · ‎11-22-2013

Hello, Tamas!

Very interesting quesion and I also wait answer on it.

I write application for encode video from CMOS and I use IPU ColorSpaceConvertion for prepare captured buffer for codec, convert YUV422 to YUV420 and resize if need, now I work with stream 1280x720@25.

Unfortunally I get not stable performance:

Processed 25 frames in 1556671 usec (16.06 FPS)

Processed 25 frames in 2433715 usec (10.27 FPS)

Processed 25 frames in 3258164 usec (7.67 FPS)

Processed 25 frames in 2094156 usec (11.94 FPS)

Processed 25 frames in 2931335 usec (8.53 FPS)

Processed 25 frames in 1769604 usec (14.13 FPS)

On test sensor I get stable performance.

Is IPU really that slow on mx53?

Is IPU really that slow on mx53?

Graphics & Display

i.MX53

Linux

Multimedia