Most of the hardware nowadays is capable of doing DMA transfers on large amount of data. Noticeably, Camera Capture Drivers (or any driver which delivers data to from any outside source to memory) can request DMA-capable buffer and put data into it.
While V4L2 architecture allows any driver to configure its DMA memory usage, it does not facilitate export of such information to Clients. Speccifically, GStreamer plugin implementation suffers terribly from the need to do incredible amount of byte-pushing in User Space, in order to pass buffer data from plugin to plugin. The incurred performance penalty is enormous!
The implementation I worked on involved TW6869 Camera capture chip which is DMA-capable; the captured RGB (or YUV) buffer had to be converted and compressed before further processing. Naturally, IPU mirroring capabilities and VPU compression capabilities determined the pipeline configuration. Also as natural pipeline builder came GST, together with its performance penalty mentioned above.
Hence, the goal becomes to connect "v4l2src" plugin with FSL plugins in a manner which allows recognition and use of DMA buffers, thus avoiding memcpy of data altogether.
Freescale's plugins do employ internal architecture of passing DMABLE info to each other, but in addition to this internal architecture, these plugins do use a more universal method to detect a DMA address.
These methods employed by Freescale plugins require patches in both "Video4Linux2" subsystem and in GStreamer's "libgstvideo4linux2.so" library.
Credit is due to Peng Zhou from Freescale for providing me with to-the-point information regarding Freescale's DMA implementation.
V4L2 uses "v4l2_buffer" structure (videodev2.h) to communicate with its clients. This structure, however, does not natively indicate to the Client whether the buffer just captured has DMA address.
Two members in the structure need attention:
For "flags" a new flag hasto be defined:
#define V4L2_BUF_FLAG_DMABLE 0x40000000 /* buffer is DMA-mapped, 'reserved' contains DMA phy address */
And, accordingly, the Driver has to set the flag and fill in the "reserved" member in both its "vidioc_querybuf" and "vidioc_dqbuf" implementations.
This done, all V4L2 clients will get the DMA information from the Driver, for each of its buffers.
GST plugin patch
This patch is more difficult to implement. Not just acquisition of a DMA address from V4L2 driver is necessary, but then passing along that information to the next plugin in the pipeline is a must.
Plugins communicate with each other using "GstBuffer" structure, which like its V4L2 cousin has no provisions for passing DMA information.
Naturally, in a pipeline where FSL plugins are involved, a FSL-compatible method should be implemented. These are the macros used in FSL plugins:
The most FSL-compatible way to implement DMABLE would be to use "_gst_reserved" extension in the structure. If maximum compatibility with FSL is the goal, then a closer look at "gstbufmeta.c" and "gstbufmeta.h" files in "gst-fsl-plugins-3.0.7/libs/gstbufmeta/" folder shall be taken. Alternatively, GST_BUFFER_FLAG_LAST and GST_BUFFER_OFFSET may be used.
The sources of GStreamer's "libgstvideo4linux2.so" need patches in several places. First of all, right after calls to VIDIOC_QUERYBUF and VIDIOC_DQBUF the information from "v4l2_buffer" must be put in the "GstBuffer" instance. Next, the "need_copy" and "always_copy" flags have to be overridden whenever GST_BUFFER_FLAG_LAST is set. For reasons I cannot explain, the macro PROP_DEF_ALWAYS_COPY in the GST sources is set to TRUE and it is the default for "always_copy"! The final -- and absolutely important patch, is to prevent overwriting of GST_BUFFER_OFFSET memeber with sequential buffer number. Very! Bad! Things! will happen to the system if improper value is passed along as physical address! This last fix can be made conditional, upon the presence of GST_BUFFER_FLAG_LAST.
Once these patches are in place, the pipeline is ready to roll, ansd roll it will really fast! My measurements by injection timing traces in FSL's "vpuenc" plugin demonstrated that the time for acquisition of GST input buffer was reduced on the average from 3+ milliseconds to 3-5 microseconds; and this improvement does not include the savings from avoiding the copy inside "libgstvideo4linux2.so" caused by "always_copy" flag!
The most important conclusion/question from the implementation above ought to be:
"Isn't it time for both V4L2 and GST to enable DMA buffer recognition and physical pointer passing?"
It is my firm belief, that such feature will result in great performance improvements for all kinds of video/audio streaming.