2360320_en-US

Yes

IMX8M Plus NPU — Batched TFLite inference shows no per-image speedup

Hi NXP,

We're running a quantized (uint8) TFLite segmentation model on the IMX8M Plus NPU using the libvx_delegate.so external delegate. The model is a MobileNet-based UNet with input shape [batch, 256, 256, 3].

To test whether larger batch sizes improve throughput, we've created a separate .tflite file for each batch size (1, 2, 4, 8, 16, 32, 64), each with the batch dimension fixed at conversion time. We then measured both total inference time and per-image time across 10 trials.

Expected behaviour: Per-image time should decrease as batch size increases, since the NPU should be able to amortize overhead and better pipeline execution.

Observed behaviour: Total inference time scales roughly linearly with batch size, giving a flat per-image time across all batch sizes. Batching provides no throughput benefit.

Our questions:

Does the VX delegate / NPU on the IMX8M Plus actually support batched execution, or does it process images sequentially regardless of batch size?
Is there a recommended way to maximise throughput — e.g. multiple interpreter instances in separate threads, or a different batching API?
Are there any eIQ / TIM-VX settings that need to be enabled for true batch parallelism?

Any insight appreciated. Thanks.

Re: IMX8M Plus NPU — Batched TFLite inference shows no per-image speedup

Hi @gabriel_gosden,

Thank you for contacting NXP Support!

1) The i.MX8M Plus NPU does not support true batched execution. Even if a TFLite model is converted with a fixed batch dimension, the VX/TIM‑VX stack processes each image in the batch sequentially. This is why total inference time scales linearly with batch size and per‑image latency remains flat.

2) The recommended approach is to use a batch size of one and focus on pipelining and parallelism on the host side. Typical strategies include running preprocessing, NPU inference, and post processing in separate threads, and in some cases using multiple TFLite interpreter instances to overlap CPU work and reduce idle time. Any throughput gains come from better pipeline utilization, not from NPU batch parallelism.

3) There are no eIQ, VX, or TIM‑VX settings that enable true batch parallelism on the i.MX8M Plus. This limitation is architectural rather than configurable. Your measurement results indicate correct and efficient NPU usage, and batching simply does not provide throughput benefits on this hardware.

Best Regards,

Chavira