HARDWARE AND SOFTWARE DETAILS
I.MX8MPlus and Linux BSP LF6.12.34_2.1.0
Goal
I’m building a command-line pipeline (no GUI) where inference, overlay, and display all run in GStreamer and NNStreamer on an i.MX8MP.
Current experiment (command line)
I tried this pipeline:
GST_DEBUG=GStreamer:4,tensor_filter:6,tensor_transform:6,tensor_decoder:7 \ gst-launch-1.0 --no-position \ v4l2src device=/dev/video4 num-buffers=200 ! \ video/x-raw,width=1920,height=1080,format=NV12,framerate=30/1 ! \ imxvideoconvert_g2d ! \ video/x-raw,width=320,height=320,format=RGBA ! \ videoconvert ! \ video/x-raw,width=320,height=320,format=BGR ! \ tensor_converter ! \ tensor_transform mode=arithmetic option=typecast:int8,add:-128 ! \ tensor_filter framework=tensorflow-lite model=${MODEL} custom=Delegate:External,ExtDelegateLib:${VX_LIB} ! \ tensor_transform mode=arithmetic option=typecast:float32,add:128.0,mul:0.004982381127774715 ! \ tensor_transform mode=transpose option=1:0:2 ! \ tensor_decoder mode=bounding_boxes option1=yolov8 option2=${LABELS} option3=0 option4=1920:1080 option5=320:320 ! \ cairooverlay name=overlay ! \ videoconvert ! \ autovideosink
log file link LINK
Problem
My YOLOv8 TFLite model outputs (1, 7, 2100), but NNStreamer on i.MX8MP expects 7 × 2100 × 1.
I received this explanation:
The YOLOv8 TFLite model outputs (1,7,2100), but NNStreamer’s YOLOv8 decoder for i.MX8MP expects 7×2100×1. This BSP version only supports transpose on 4D tensors, so the model output needs dequantization, reshape to (1,7,2100,1), then transpose.
input: int8 [1, 320, 320, 3]
output: int8 [1, 7, 2100]
scale/zero point
output correctly contains 3 classes + 4 bbox values
Current (slow) approach
Right now the application flow is:
GStreamer → BGR → OpenCV
NPU inference
OpenCV post-processing
Back to RTSP pipeline
This causes multiple software videoconvert, and in ideal conditions we reach only ~20 FPS, although the model alone can run 60+ FPS.
Proposed new approach
I want to split the pipeline:
Path A — Inference
Convert NV12 → BGR only here
Run NNStreamer
Path B — Overlay + Display
Keep original NV12/YUY2 frames
Draw bounding boxes directly on NV12 (preferably using hardware)
→ Feed NV12 to encoder / RTSP
→ Avoid software videoconvert completely
I’d first like to prototype this using pure gst-launch, then apply the approach in Python (possibly using OpenGL for NV12 overlay).
What I need help on
How to reshape/transpose (1,7,2100) TFLite output into the format required by NNStreamer’s YOLOv8 decoder on i.MX8MP
Best practice for overlay on NV12/YUY2
General advice: Is the split-pipeline (inference on BGR, overlay on NV12) a reasonable architectural direction on i.MX8MP?