Camera Capture through IPU and VPU with minimum GStreamer usage

ilkodossev · ‎01-27-2014

People,

In a Linux system GStreamer is doing sufficient job capturing a couple of video streams from camera and delivering them to a file or network sink, for more and bigger streams it us to much memcpy-ing for good performance.

Native MX6 Cameras can create stream directly through IPU and VPU to the Display, but this pipeline is not available for PCI-E multi-camera adapters.

The architecture I am currently working with is TW6869 Intersil PCI-E chip which is able to deliver up to 8 streams over DMA, in RGB16 or UYVY formats, at 30 FPS.

I am able to pass the DMA frame buffer for each camera, directly from my Capture Driver, to the IPU for conversion to either NV12 or I240 planar formats. IPU returns the result in DMA buffers of its own which I allocate at the start of the streaming. But, passing afterwards these buffers to User Space for processing by GStreamer happens to be very performance-degrading; my estimate is that at least 4 "memcpy" operations take place on buffers which are 3/4 of the 640*480*16bpp big.

My idea is to feed these NV12 (or I420) buffers directly to the VPU for encoding into compressed image format, like H.264, suitable for streaming over network connections.

However, the interface to the VPU driver "mxc_vpu.ko" is much more complicated than the interface to the "mxc_ipu.ko" driver; one user-mode library "libvpu.so" sits on top of the VPU Kernel driver and over it is the FSL GST plugin for Video Encoder. I am certain that such architecture is feasible (at least in theory) but:

1. Is this IPU-to-VPU data & control connectivity practically usable for MX6 Quad or Dual platforms?... and

2. If yes, then what would be the most efficient development to get the control of the data streams in the Camera Capture Driver code?

Thanks for your opinions and advice,

Ilko Dossev

Qualnetics

admin · ‎02-05-2014

Hi Ilko,

I think this issue would get more visibility and responses if you moved it to the i.MX Community. You should see a "Move" option on the right-hand side of this page.

Regards,

Grant

View solution in original post

ilkodossev · ‎02-24-2014

I found solution to pipeline (a) -- exporting DMA buffers thru V4L2 to another GST plugin.

As it is, FSL plugins use two different methods to detect DMAbility of passed buffers -- one is proprietary to FSL, the other is more universal.

The propriietary method uses "_gst_reserved" and patching any GST plugin requires reference to FSL sources or library code.

I am describing my more universal implementation in a separate post within iMX community:

"How to pass DMA buffers from a Driver thru GStreamer 'v4l2src' plugin to Freescale IPU/VPU plugins"

Much credit is due to Peng Zhou for his to-the-point comments and responses. Ther helped me to keep my focus on the right direction.

ilkodossev · ‎02-13-2014

The streaming direction is the opposite of what is the usual path, Typically, Display is the destination. For my goals, the final sink for the stream is WiFi driver.

6 cameras, capturing 640*480 frames in Ping-Pong buffers at 30 FPS, on iMX6Quad platform, are almost working! :smileyhappy:

This is currently working implementation:
CamCap => [RGB DMA buffer] => IPU => [YUV420 DMA buffer] => GStreamer => [YUV420 DMA buffer] => VPU => [AVC DMA buffer] => GStreamer => final sink

To my greatest chagrin, GStreamer operations involve a lot of memcpy from and to DMA buffers, which is performance killer.

A useful practical implementation would be this:
CamCap => [RGB DMA buffer] => IPU => [YUV420 DMA buffer] => VPU => [AVC DMA buffer] => GStreamer => final sink

Feeding directly IPU output to VPU input would save a ton of memcpy operations; leaving one only at the end when compressed data are passed to GStreamer.

Ideal dream-stream would be setting up IPU and VPU to work in sequence, passing data internally:
CamCap => [RGB DMA buffer] => IPU => [...] => VPU => [AVC DMA buffer] => GStreamer => final sink

There is one catch here, though -- because the time for processing a frame is about 2.5 milliseconds, the combined IPU/VPU team should deliver within this limit!

(A total of 360 FPS for the whole system leaves less than 3 milliseconds processing time for each processing step.)

I moved a lot of source code from "libvpu" library to my Kernel Driver, and I am able to initialize VPU properly from my KO.

Next step -- which would finalize the "useful practical implementation" -- is to set up properly VPU processing of input frames:

vpu_EncStartOneFrame

...wait-for-VPU-interrupt...

vpu_EncGetOutputInfo

It seems that not all the time waiting for the interrupt completes properly.
Any ideas what might be wrong?

I.D.

eaglezhou · ‎02-13-2014

Hi, llko

    Just what you have mentioned "Native MX6 Cameras can create stream directly through IPU and VPU to the Display,"
    I am not very clear why there are additional memcpy operation take place on yuv buffers for "PCI-E multi-camera" case,
    all yuv buffers should be transfered with pointer even from kernel to user space instead of memcpy.
    So I still don't understand why you need to port vpu library from user space into kernel space.

Can you point where the memcpy operation happen in gstreamer ?
What do your final gst-plugin pipeline(gst-launch command) looks like ?

Back to your questions:

      1. The below two apis are only used for ringbuffer mode which is not used by our plugin.
         for ringbuffer mode, additional operation are required to copy input data into internal bitstream buffer.
            vpu_EncGetBitstreamBuffer
            vpu_EncUpdateBitstreamBuffer
      2. "It seems that not all the time waiting for the interrupt completes properly"
           vpu_EncStartOneFrame
           ...wait-for-VPU-interrupt...
           vpu_EncGetOutputInfo
         Suppose all frame interrupt should complete properly.
         Please double check whether this issue happen on fresscale's official release, but not your own kernel porting version.
      3. IPU support many conversion formats from RGB to YUV(I420,NV12,TNVP)
         In general, the format TNVP(tile format) should have best performance.

Eagle

ilkodossev · ‎02-14-2014

I am using LTIB for iWave, Linux-3.0.35. Cameras are not native to iMX6 platform, Camera Capture chip is TW6869 on PCI-e bus.

Currently working implementation:

CamCap => [RGB DMA buffer] => IPU => [YUV420 DMA buffer] => GStreamer => [YUV420 DMA buffer] => VPU => [AVC DMA buffer] => GStreamer => final sink

gst-launch -v --gst-debug-level=1 v4l2src device=/dev/video5 ! video/x-raw-yuv,format=(fourcc)I420,width=(int)640,height=(int)480,framerate=30/1 ! vpuenc codec=6 seqheader-method=3 bitrate=500000 gopsize=15 quant=10 framerate-nu=30 framerate-de=1 force-framerate=true ! mpegtsmux ! tcpserversink port=5005

Camera Capture Driver invokes directly IPU Kernel Driver "mxc_ipu.ko" and submits the conversion tasks. IPU driver sources are in ".../linux-3.0.35/drivers/mxc/ipu3/" folder.

YUV420 buffer, as delivered by IPU, is exported to GStreamer by remapping physical address to User VAddr. Full efficiency & performance here.

Gstreamer invokes VPU on this stack of modules:

libgstvideo4linux2.so -- GST plugin; source files in ".../ltib/rpm/BUILD/gst-plugins-good-0.10.30/sys/v4l2/" folder; VPU-related source is in "gstv4l2object.c"

libvpu.so.4 -- VPU Application Driver; source files in ".../ltib/rpm/BUILD/imx-lib-3.0.35-4.0.0/vpu/" folder; portions of these sources are ported to Kernel

mxs_vpu.ko -- VPU Kernel Driver; source files in ".../linux-3.0.35/drivers/mxc/vpu/" folder

The GStreamer plugin "libgstvideo4linux2.so" is responsible for memcpy. Even plain visual inspection of the source file "gstv4l2object.c" reveals the extent of use.

In addition to it, for each frame subject to VPU compression, GSrteamer plugin executes the following inefficient sequence:

GetVpuPhysMem -- remap to VAddr -- copy YUV frame -- invoke VPU -- get compressed frame -- unmap VAddr -- FreeVpuPhysMem

This is the performance bottleneck I want to eliminate, by invoking VPU directly in the same manner I invoke IPU directly from Camera Capture Driver.

VPU interrupts work perfectly well in this scenario; the native libvpu.so.4 and mxc_vpu.ko drivers manage interrupts smoothly.

The TNVP format mentioned above does not appear to be accepted as FOURCC neither by GStreamer, nor by mxc_ipu.ko Kernel Driver. IPUv3 software deals with all kinds of YUV formats, TNVP FOURCC is not seen in any of IPUv3 source files. It seems that this format is only applicable to VPU decoder? If I am to try TNVP, what GSreamer command line would be suggested?

Practical efficient implementation under construction:

CamCap => [RGB DMA buffer] => IPU => [YUV420 DMA buffer] => VPU => [AVC DMA buffer] => GStreamer => final sink

gst-launch -v --gst-debug-level=1 v4l2src device=/dev/video5 ! video/x-h264,format=(fourcc)H264,width=(int)640,height=(int)480,framerate=30/1 ! mpegtsmux ! tcpserversink port=5005

This is the work-in-progress I need help with. In this architecture, the YUV420 output from IPU is directly fed to VPU, and only then the compressed AVC frame is passed to GST.

In all cases, there ought to be memcpy from VPU output DMA buffer to GST User buffer, as frame header for IFrames must be prepended when IFrames are delivered.

Only in thise ported code VPU interrupts are failing; this definitely is problem with the port and not with the original impleemntation.

Because H264 FOURCC is not recognized by VPU libgstvideo4linux2.so plugin, I had to apply patch to the source file "gstv4l2object.c" and rebuild the library.

It would be interesting to know, if in this direct-call-to-VPU scenario, ringbuffer mode may be applicable?

That is, would it be possible to link IPU and VPU execution together, so that when Camera delivers RGB frame, that frame can be fed to the IPU and then the ultimate output be delivered as comressed frame by the VPU? That would be the ideal dream-stream implementation.

I.D.

eaglezhou · ‎02-16-2014

Hi, llko

So seems you are modifying the open source v4l2src plugin.

I think you can refer to our mfw_gst_v4lsrc plugin to check how it avoid the yuv buffer copy.

mfw_v4lsrc allocate physical frame buffer and transfer it (through field _gst_reserved[]) to vpu,

vpu plugin will avoid buffer copy through checkig the field '_gst_reserved' if it is physical memory.

For the ringbuffer mode, I think it is low efficient

and additional effort are required since its API behavior are different

For your use case (just want to avoid memcpy), I think I420/NV12 are enough.

Eagle

ilkodossev · ‎02-17-2014

Thanks for the clarification regarding the ringbuffer! It certainly helps to know not to invest any efforts in that direction.

The reference to "mfw_gst_v4lsrc.c" is useful, too.

Yes, I patched 'v4l2src' to let it recognize H264 fourcc, but nothing beyond that.

I am not touching either of FSL plugins -- neither VPU nor IPU plugins, except for some performance traces.

I looked up, however, the private FSL mechanism used to detct DMA-capable mmap-ed buffers.

IS_DMABLE_BUFFER and DMABLE_BUFFER_PHY_ADDR macros serve this purpose, in both IPU and VPU plugins.

Certainly ".../ltib/rpm/BUILD/gst-fsl-plugins-3.0.7/src/misc/v4l_source/src/mfw_gst_v4lsrc.c" will provide useful reference.

I doubt it would be proper to use these macros directly in the 'v4l2src' plugin in the ".../gst-plugins-good-0.10.30/" branch, so I will have to port the implementation.

ALTERNATIVE
I figured out the problems w/ VPU interrupts, and now I can invoke IPU and VPU in cascade and obtain H264 compressed data.

But at this time I do not know how to tell 'v4l2src' plugin -- and GST -- about the captured H264 frame.

Just passing the buffer does not work; pipeline seems open and ready to stream, but no data goes to the network sink.

Definitely proper code needs to be added to "vidioc_dqbuf" callback, and maybe 'v4l2src' needs more patches.

At this time, I have two alternative possible solutions:

(A) modify 'v4l2src' plugin and make it similar in operations to FSL plugins, so that DMA address is detected; or
(B) figure out how to export H264 data to GStreamer, via 'v4l2src' plugin -- again, 'v4l2src' may require attention.

Performance-wise, it seems that (B) will fare better.

I.D.

eaglezhou · ‎02-17-2014

Hi, llko

I think option (A) is more modularized and need less effort than (B). what you need to do is just porting the method used for physical address transfer

For option (B), you need more effort to merge capture and encoder operation into the same plugin.

Eagle

ilkodossev · ‎02-03-2014

Some progress has been made since I posted this Q on the forums a few days ago.

My initial step was to move Encoder functionality from "libvpu.so.4" to Kernel space.

In the process, I had to add few exports to "mxc_vpu.ko" so that access from my Kernel Driver can bypass some IOCTL calls.

Initialization functions, and these few are now in my Kernel Driver:
vpu_EncOpen

vpu_EncClose
vpu_EncGetInitialInfo

vpu_EncRegisterFrameBuffer

vpu_EncGiveCommand

vpu_EncStartOneFrame

vpu_EncGetOutputInfo

So far - so good; but more questions remain to be resolved.

First, what is the purpose and usability of these two Encoder functions:

vpu_EncGetBitstreamBuffer

vpu_EncUpdateBitstreamBuffer

These two do not appear to be invoked while GStreamer operates IPU and VPU plugins/drivers.

Second, and most important: what would be the best setup to do the encoding, by minimizing allocations of DMA/VMem buffers and achieve maximum performance?

The pipeline for sending decoded video stream to display seems to be highly optimized for performance, whatever the source might be; but the opposite direction -- capturing and storing video stream to a sink which is not display still needs performance improvements.

I.D.

admin · ‎02-05-2014

Hi Ilko,

I think this issue would get more visibility and responses if you moved it to the i.MX Community. You should see a "Move" option on the right-hand side of this page.

Regards,

Grant

ilkodossev · ‎02-13-2014

I definitely cannot find that "move" option, else I'd have it used. :smileysad: