High CPU usage when sharing raw camera video across multiple processes using GStreamer shmsink/shmsr

SiddavatamVishnu · ‎03-02-2026

Question Body

I am working on an embedded Linux project (i.MX8 platform) where I need to share raw camera video across multiple processes using a producer/consumer architecture.

Architecture

I have:

Producer process
- Captures camera using v4l2src
- Sends raw NV12 1080p30 frames to shared memory using shmsink
Multiple consumer processes
- One records to file
- One performs AI inference
- One (or more) provides RTSP streaming
- All consume raw frames via shmsrc

Example producer pipeline:

gst-launch-1.0 \
v4l2src device=/dev/video3 io-mode=dmabuf ! \
video/x-raw,format=NV12,width=1920,height=1080,framerate=30/1 ! \
queue ! \
shmsink socket-path=/tmp/cam.sock wait-for-connection=false sync=false

Example consumer (RTSP branch):

shmsrc socket-path=/tmp/cam.sock is-live=true do-timestamp=true ! \
video/x-raw,format=NV12,width=1920,height=1080,framerate=30/1 ! \
videoscale ! videorate ! \
v4l2h264enc ! h264parse ! rtph264pay

Problem

CPU usage is extremely high.

Producer alone consumes ~100% (of one core)
Producer + 3 RTSP branches consume ~360% of 400% total CPU (quad core system)

This is unexpected because:

Encoding is hardware accelerated (v4l2h264enc)
Capture uses io-mode=dmabuf
No software decoding is involved

However, sharing raw frames across processes appears very expensive.

Observations

Each branch reads raw 1080p frames from shared memory.
Each branch performs scaling and framerate conversion independently.
Shared memory causes one memory copy per branch.
Total memory bandwidth becomes very high:
- 1920×1080×1.5 bytes ≈ 3MB per frame
- 3MB × 30 fps × multiple branches

Using tee inside a single process reduces CPU usage significantly, but:

Static tee does not meet my requirements.
I need dynamic branch creation and removal.
I need true inter-process separation (producer/consumer model).
appsrc introduces heavy copying and does not share buffers across processes.
I want to strictly avoid re-encoding and re-decoding between processes.
UDP/RTP transport is not preferred because it requires encoding.

Question

What is the best architecture to:

Share raw camera video across multiple processes
Avoid excessive CPU usage
Avoid repeated encoding/decoding
Allow dynamic branch creation
Maintain process isolation

Specifically:

Is shmsink/shmsrc inherently copy-heavy and memory-bandwidth limited?
Is there a way to share DMABUF across processes without copying?
Would encoding once and sharing compressed stream be the only scalable solution?
Are there NXP/i.MX specific mechanisms (DMA-BUF export, v4l2 memory sharing, imx plugins) better suited for this?
Is there a recommended design pattern for this use case on embedded systems?

Goal

My goal is to:

Avoid multiple encode/decode cycles
Avoid unnecessary copies
Keep CPU usage minimal
Support dynamic consumer processes

Any architectural guidance or NXP-specific recommendations would be highly appreciated.

Producer command
gst-launch-1.0 v4l2src device=/dev/video3 io-mode=dmabuf ! video/x-raw,format=NV12,width=1920,height=1080,framerate=30/1 ! queue ! shmsink socket-path=/tmp/test_shm.sock wait-for-connection=false sync=false shm-size=200000000

Consumer Code

link consumer_python.py

Chavira · ‎03-03-2026

HI @SiddavatamVishnu,

1. Is shmsink/shmsrc inherently copy‑heavy and memory‑bandwidth limited?
Yes. shmsink and shmsrc are fundamentally copy‑heavy because they serialize and deserialize full raw video buffers through shared memory rather than passing DMA‑BUF file descriptors. This means every frame must be fully copied into the shared memory region by the producer, and then each consumer must copy the frame out again into its own pipeline. At 1080p NV12, this results in multiple 3 MB copies per frame per consumer, quickly overwhelming memory bandwidth on embedded platforms like the i.MX8. Because these elements cannot transfer zero‑copy buffer handles, their architecture inherently scales poorly when distributing high resolution raw frames.

2. Is there a way to share DMABUF across processes without copying?
Sharing DMA‑BUF across processes is technically possible, but only through mechanisms that pass file descriptors directly between processes, such as UNIX domain sockets. However, GStreamer elements must explicitly support importing and exporting DMA‑BUF FDs for this to work. Most glue elements, including shmsink, shmsrc, appsink, and appsrc do not support FD passing, so they always fall back to memory‑based copies. To achieve true inter‑process zero‑copy DMA‑BUF sharing, you would need custom code or specialized elements that handle DMABUF FD passing, plus a custom allocator and buffer lifetime management. This is feasible but significantly complex and not supported by standard GStreamer elements.

3. Would encoding once and sharing compressed stream be the only scalable solution?
For multi‑process architectures on i.MX8, encoding once and distributing a compressed stream is indeed the only scalable and practical solution. Hardware video encoders such as v4l2h264enc impose minimal CPU cost and drastically reduce bandwidth requirements, making it efficient to pass the stream to multiple consumer processes. Each consumer can then either pass the encoded data directly or decode it once using the hardware decoder. This approach avoids the high memory footprint and repeated buffers copies inherent in raw-frame fan out designs, making it the architecture most commonly used in commercial embedded products.

4. Are there NXP/i.MX specific mechanisms (DMA‑BUF export, v4l2 memory sharing, imx plugins) better suited for this?
NXP’s i.MX8 platform provides good mechanisms for zero‑copy GPU/VPU processing such as V4L2 DMA‑BUF import/export, the G2D/GC7000 GPU, and memory‑zero‑copy GStreamer plugins, but these mechanisms only work inside a single process. Zero‑copy through DMA‑BUF works well between elements like v4l2src, v4l2convert, and v4l2h264enc, but does not extend across processes unless you manually implement FD passing. The NXP specific GStreamer plugins (e.g., imxv4l2videosrc, imxg2dvideoscale) also assume in‑process zero‑copy pipelines. Therefore, while these tools help within one pipeline, they do not solve the problem of multi‑process raw video distribution.

5. Is there a recommended design pattern for this use case on embedded systems?
Yes, the widely recommended pattern is to avoid sharing raw frames across processes and instead perform a single hardware encode in the producer, then distribute the compressed stream to as many consumers as needed. This minimizes memory bandwidth, avoids unnecessary raw buffer copies, and keeps processing efficiency aligned with the hardware capabilities of the i.MX8. If true raw access is required by more than one subsystem, then the system is typically designed to keep all raw‑processing components inside a single GStreamer pipeline using tee, while other processes interact with the system through control-plane IPC rather than consuming raw video. This pattern achieves optimal performance, scalability, and system isolation without overloading memory or CPU resources.

View solution in original post

Chavira · ‎03-03-2026

HI @SiddavatamVishnu,

1. Is shmsink/shmsrc inherently copy‑heavy and memory‑bandwidth limited?
Yes. shmsink and shmsrc are fundamentally copy‑heavy because they serialize and deserialize full raw video buffers through shared memory rather than passing DMA‑BUF file descriptors. This means every frame must be fully copied into the shared memory region by the producer, and then each consumer must copy the frame out again into its own pipeline. At 1080p NV12, this results in multiple 3 MB copies per frame per consumer, quickly overwhelming memory bandwidth on embedded platforms like the i.MX8. Because these elements cannot transfer zero‑copy buffer handles, their architecture inherently scales poorly when distributing high resolution raw frames.

2. Is there a way to share DMABUF across processes without copying?
Sharing DMA‑BUF across processes is technically possible, but only through mechanisms that pass file descriptors directly between processes, such as UNIX domain sockets. However, GStreamer elements must explicitly support importing and exporting DMA‑BUF FDs for this to work. Most glue elements, including shmsink, shmsrc, appsink, and appsrc do not support FD passing, so they always fall back to memory‑based copies. To achieve true inter‑process zero‑copy DMA‑BUF sharing, you would need custom code or specialized elements that handle DMABUF FD passing, plus a custom allocator and buffer lifetime management. This is feasible but significantly complex and not supported by standard GStreamer elements.

3. Would encoding once and sharing compressed stream be the only scalable solution?
For multi‑process architectures on i.MX8, encoding once and distributing a compressed stream is indeed the only scalable and practical solution. Hardware video encoders such as v4l2h264enc impose minimal CPU cost and drastically reduce bandwidth requirements, making it efficient to pass the stream to multiple consumer processes. Each consumer can then either pass the encoded data directly or decode it once using the hardware decoder. This approach avoids the high memory footprint and repeated buffers copies inherent in raw-frame fan out designs, making it the architecture most commonly used in commercial embedded products.

4. Are there NXP/i.MX specific mechanisms (DMA‑BUF export, v4l2 memory sharing, imx plugins) better suited for this?
NXP’s i.MX8 platform provides good mechanisms for zero‑copy GPU/VPU processing such as V4L2 DMA‑BUF import/export, the G2D/GC7000 GPU, and memory‑zero‑copy GStreamer plugins, but these mechanisms only work inside a single process. Zero‑copy through DMA‑BUF works well between elements like v4l2src, v4l2convert, and v4l2h264enc, but does not extend across processes unless you manually implement FD passing. The NXP specific GStreamer plugins (e.g., imxv4l2videosrc, imxg2dvideoscale) also assume in‑process zero‑copy pipelines. Therefore, while these tools help within one pipeline, they do not solve the problem of multi‑process raw video distribution.

5. Is there a recommended design pattern for this use case on embedded systems?
Yes, the widely recommended pattern is to avoid sharing raw frames across processes and instead perform a single hardware encode in the producer, then distribute the compressed stream to as many consumers as needed. This minimizes memory bandwidth, avoids unnecessary raw buffer copies, and keeps processing efficiency aligned with the hardware capabilities of the i.MX8. If true raw access is required by more than one subsystem, then the system is typically designed to keep all raw‑processing components inside a single GStreamer pipeline using tee, while other processes interact with the system through control-plane IPC rather than consuming raw video. This pattern achieves optimal performance, scalability, and system isolation without overloading memory or CPU resources.