eIQ inference performance issue with GPU

cancel
Showing results for 
Search instead for 
Did you mean: 

eIQ inference performance issue with GPU

214 Views
mbrundler
Contributor II

Hi,

We built our I.MX8MP target image and the SDK with yocto, using the Linux 5.10.52_2.1.0 version.

We used the eIQ Armnn and Onnx Runtime inference engines to perform inference on one of our networks (fp32 data), exported in onnx format.

For inference with ArmNN, we slightly modified the mnist_tf.cpp sample program to adapt it to our specific network. The result is functionally OK with the three available backends (CpuRef, CpuAcc and VsiNpu). For performances, the CpuRef is terribly slow which is expected. What is surprising is the  VsiNpu backend which executes on GPU/NPU is 13 times slower than the CpuAcc backend which executes on the Arm CPU with Neon. When using the mnist_tf.cpp program with its original network (simple_mnist_tf.prototxt), the VsiNpu backend is also slower. 

We also executed already made ArmNN "Onnx test" as described in the Machine Learning UG (https://www.nxp.com/docs/en/user-guide/IMX-MACHINE-LEARNING-UG.pdf) at §5.3.4. The OnnxMobileNet-Armnn test is also more than 3 time slower with the VsiNpu backend compared to the CpuAcc  backend.

We did the same thing with the Onnx Runtime inference engine. The provided sample code (C_Api_Sample.cpp) running either on the original network (squeezenet in that case) or adapted to run our network, and observed the same beaviour : in all cases the backend targeting GPU / NPU is much slower than the backend targeting CPU with Neon.

In all cases, the inference is performed 2 times and the second is measured to account for the warmup time.

Tests with our model converted to use fp16 data performed with the Onnx Runtime inference engine (it seems ArmNN does not support it) show the same results, except the Nnapi backend also targeting GPU / NPU has the same performances as the CPU backends.

So, we observe the GPU / NPU backends at best perform like the CPU backends and are several times slower in the worst cases.

We would like to know what could be the reason for this behaviour.

Regards

 

 

Labels (1)
0 Kudos
2 Replies

202 Views
Qmiller
NXP Employee
NXP Employee

The tensor of NPU can't support float input/output.It can support 8/16-bit integer Tensor data format and support 8, 16, 32-bit integer operations pipeline.

0 Kudos

192 Views
mbrundler
Contributor II

So, the NPU is not efficient on floating-point calculations, but

  • would the GPU perform better ?
  • if yes, is there a way to request calculations be scheduled to the GPU ?
0 Kudos