HI @SiddavatamVishnu,
1. Is shmsink/shmsrc inherently copy‑heavy and memory‑bandwidth limited?
Yes. shmsink and shmsrc are fundamentally copy‑heavy because they serialize and deserialize full raw video buffers through shared memory rather than passing DMA‑BUF file descriptors. This means every frame must be fully copied into the shared memory region by the producer, and then each consumer must copy the frame out again into its own pipeline. At 1080p NV12, this results in multiple 3 MB copies per frame per consumer, quickly overwhelming memory bandwidth on embedded platforms like the i.MX8. Because these elements cannot transfer zero‑copy buffer handles, their architecture inherently scales poorly when distributing high resolution raw frames.
2. Is there a way to share DMABUF across processes without copying?
Sharing DMA‑BUF across processes is technically possible, but only through mechanisms that pass file descriptors directly between processes, such as UNIX domain sockets. However, GStreamer elements must explicitly support importing and exporting DMA‑BUF FDs for this to work. Most glue elements, including shmsink, shmsrc, appsink, and appsrc do not support FD passing, so they always fall back to memory‑based copies. To achieve true inter‑process zero‑copy DMA‑BUF sharing, you would need custom code or specialized elements that handle DMABUF FD passing, plus a custom allocator and buffer lifetime management. This is feasible but significantly complex and not supported by standard GStreamer elements.
3. Would encoding once and sharing compressed stream be the only scalable solution?
For multi‑process architectures on i.MX8, encoding once and distributing a compressed stream is indeed the only scalable and practical solution. Hardware video encoders such as v4l2h264enc impose minimal CPU cost and drastically reduce bandwidth requirements, making it efficient to pass the stream to multiple consumer processes. Each consumer can then either pass the encoded data directly or decode it once using the hardware decoder. This approach avoids the high memory footprint and repeated buffers copies inherent in raw-frame fan out designs, making it the architecture most commonly used in commercial embedded products.
4. Are there NXP/i.MX specific mechanisms (DMA‑BUF export, v4l2 memory sharing, imx plugins) better suited for this?
NXP’s i.MX8 platform provides good mechanisms for zero‑copy GPU/VPU processing such as V4L2 DMA‑BUF import/export, the G2D/GC7000 GPU, and memory‑zero‑copy GStreamer plugins, but these mechanisms only work inside a single process. Zero‑copy through DMA‑BUF works well between elements like v4l2src, v4l2convert, and v4l2h264enc, but does not extend across processes unless you manually implement FD passing. The NXP specific GStreamer plugins (e.g., imxv4l2videosrc, imxg2dvideoscale) also assume in‑process zero‑copy pipelines. Therefore, while these tools help within one pipeline, they do not solve the problem of multi‑process raw video distribution.
5. Is there a recommended design pattern for this use case on embedded systems?
Yes, the widely recommended pattern is to avoid sharing raw frames across processes and instead perform a single hardware encode in the producer, then distribute the compressed stream to as many consumers as needed. This minimizes memory bandwidth, avoids unnecessary raw buffer copies, and keeps processing efficiency aligned with the hardware capabilities of the i.MX8. If true raw access is required by more than one subsystem, then the system is typically designed to keep all raw‑processing components inside a single GStreamer pipeline using tee, while other processes interact with the system through control-plane IPC rather than consuming raw video. This pattern achieves optimal performance, scalability, and system isolation without overloading memory or CPU resources.