i.MX 8M Plus NPU Warmup Time

The i.MX 8M Plus is a powerful quad-core Arm® Cortex®-A53 applications
processor running at up to 1.8 GHz with an integrated neural processing unit
(NPU) delivering up to 2.3 TOPS. As the first i.MX processor with a machine
learning accelerator, the i.MX 8M Plus processor delivers substantially higher
performance for ML inference at the edge.

When comparing NPU with CPU performance on the i.MX 8M Plus, the perception is that inference time is much longer on the NPU. This is due to the fact that the ML accelerator spends more time performing overall initialization steps.

This initialization phase is known as warmup and is necessary only once at the beginning of the application. After this step inference is executed in a truly accelerated manner as expected for a dedicated NPU.

Warmup time will usually affect only the first inference run. However, depending on the ML model type it might be noticeable for the first few inference runs. Some preliminary tests must be done to take a decision on what to consider warmup time. Once the warmup phase is well delimited, the subsequent inference runs can be considered as pure inference and used to compute an average for the inference phase.

Currently the warmup time can be decreased for subsequent application runs by using a caching mechanism. One should measure the impact of this feature on overall system performance, in order to decide if the application will benefit from using this mechanism.

There is work in progress to improve the initial warm-up time, speeding up initialization without using the caching mechanism, so please check future releases for related updates.

A dedicated application note is available to clarify the impact of the warmup time on overall performance: AN12964. It covers the theoretical aspects and also provides sample code to use as a starting point for evaluating the performance in the case of multiple models running sequentially.