Hi,
We built our I.MX8MP target image and the SDK with yocto, using the Linux 5.10.52_2.1.0 version.
We used the eIQ Armnn and Onnx Runtime inference engines to perform inference on one of our networks (fp32 data), exported in onnx format.
For inference with ArmNN, we slightly modified the mnist_tf.cpp sample program to adapt it to our specific network. The result is functionally OK with the three available backends (CpuRef, CpuAcc and VsiNpu). For performances, the CpuRef is terribly slow which is expected. What is surprising is the VsiNpu backend which executes on GPU/NPU is 13 times slower than the CpuAcc backend which executes on the Arm CPU with Neon. When using the mnist_tf.cpp program with its original network (simple_mnist_tf.prototxt), the VsiNpu backend is also slower.
We also executed already made ArmNN "Onnx test" as described in the Machine Learning UG (https://www.nxp.com/docs/en/user-guide/IMX-MACHINE-LEARNING-UG.pdf) at §5.3.4. The OnnxMobileNet-Armnn test is also more than 3 time slower with the VsiNpu backend compared to the CpuAcc backend.
We did the same thing with the Onnx Runtime inference engine. The provided sample code (C_Api_Sample.cpp) running either on the original network (squeezenet in that case) or adapted to run our network, and observed the same beaviour : in all cases the backend targeting GPU / NPU is much slower than the backend targeting CPU with Neon.
In all cases, the inference is performed 2 times and the second is measured to account for the warmup time.
Tests with our model converted to use fp16 data performed with the Onnx Runtime inference engine (it seems ArmNN does not support it) show the same results, except the Nnapi backend also targeting GPU / NPU has the same performances as the CPU backends.
So, we observe the GPU / NPU backends at best perform like the CPU backends and are several times slower in the worst cases.
We would like to know what could be the reason for this behaviour.
Regards
The tensor of NPU can't support float input/output.It can support 8/16-bit integer Tensor data format and support 8, 16, 32-bit integer operations pipeline.
So, the NPU is not efficient on floating-point calculations, but