As part of our product we're required to deinterlace a 30fps 1080i stream (received over ethernet) after its been decoded by the VPU and blend it together with our user interface for display over HDMI at an output resolution of 1920x1080 (ideally at 60hz, but 30 would suffice). We've tried many different approaches but none of them appear to leave enough head room on the MMDC to accomplish any meaningful graphics rendering (subtitles) in addition to decoding/deinterlacing the video stream. The technique that comes closest to being viable is to take an NV12 formatted buffer from the VPU and pass it to the IPU's VDIC block which then outputs the deinterlaced frame directly to memory in the overlay framebuffer (once again, NV12 format). We then page flip the overlay upon rendering each new frame. Using this technique we do not have to perform any memcpys on the CPU side; the VPU fills a buffer with the decoded frame, the IPU fills the framebuffer memory with the deinterlaced frame. To optimize the transfer to the IPU we've incorporated a kernel change (created by 'wolfgar' for Kodi) that increases the IPU read size for the VDIC IDMAC channel to 64 bytes. We're also clocking the VPU at 352Mhz in order to help reduce the VPU cost. Keeping the buffers in NV12 format the entire flow seems to help since the MEM to IPU and IPU to MEM transfers are smaller (12bpp). Unfortunately even this approach does not leave enough head room for us to render our user interface elements (roll-up/pop-up subtitles) without causing the MMDC to max out which leads to video playback issues since the VPU and IPU are starved for bus cycles. Using the overlay framebuffer comes with its own penalty since it doubles the memory bandwidth required by the IPU to output the blended framebuffers to HDMI, but this seems to be more efficient on the MMDC side than trying to blend everything using the GPU.
Here are a few other approaches we've tried that also do _not_ work:
1) Configure the VDIC to use the IC to scale the image to 1280x720 before sending it back to memory. Then use the GPU to upscale the image back to 1920x1080 during compositing with the rest of the UI.
2) A variation of the first technique but instead composite the UI and the video frame at 720 and then use the IC to scale the entire framebuffer to 1080 before display.
3) Avoid the VDIC and instead perform a BOB deinterlace on the ARM core.
We are hoping for assistance in strategies for reducing the MMDC bottleneck. Here are a few questions that could potentially improve the situation:
1) The VPU/VDOA documentation states that the VPU can output buffers in a tiled NV12 format and that the VDOA can be used to convert those buffers into an IPU supported format before relaying the data to the IPU. A quick experiment using the VPU to output a tiled format seemed to result in higher utilization. Should such a configuration utilize the MMDC more efficiently? Are additional steps required beyond setting the 'map type' flag to a tiled format on the decoder open parameters?
2) When the VDIC writes the deinterlaced frame back to memory, mmdc_prof shows the average write burst size as 16 bytes. Is there any way to configure VDIC/IDMAC/IPU to output larger bursts?
3) Are there other strategies we could utilize that will help us reduce MMDC utilization?