eIQ inference performance issue with GPU

キャンセル
次の結果を表示 
表示  限定  | 次の代わりに検索 
もしかして: 

eIQ inference performance issue with GPU

1,036件の閲覧回数
mbrundler
Contributor II

Hi,

We built our I.MX8MP target image and the SDK with yocto, using the Linux 5.10.52_2.1.0 version.

We used the eIQ Armnn and Onnx Runtime inference engines to perform inference on one of our networks (fp32 data), exported in onnx format.

For inference with ArmNN, we slightly modified the mnist_tf.cpp sample program to adapt it to our specific network. The result is functionally OK with the three available backends (CpuRef, CpuAcc and VsiNpu). For performances, the CpuRef is terribly slow which is expected. What is surprising is the  VsiNpu backend which executes on GPU/NPU is 13 times slower than the CpuAcc backend which executes on the Arm CPU with Neon. When using the mnist_tf.cpp program with its original network (simple_mnist_tf.prototxt), the VsiNpu backend is also slower. 

We also executed already made ArmNN "Onnx test" as described in the Machine Learning UG (https://www.nxp.com/docs/en/user-guide/IMX-MACHINE-LEARNING-UG.pdf) at §5.3.4. The OnnxMobileNet-Armnn test is also more than 3 time slower with the VsiNpu backend compared to the CpuAcc  backend.

We did the same thing with the Onnx Runtime inference engine. The provided sample code (C_Api_Sample.cpp) running either on the original network (squeezenet in that case) or adapted to run our network, and observed the same beaviour : in all cases the backend targeting GPU / NPU is much slower than the backend targeting CPU with Neon.

In all cases, the inference is performed 2 times and the second is measured to account for the warmup time.

Tests with our model converted to use fp16 data performed with the Onnx Runtime inference engine (it seems ArmNN does not support it) show the same results, except the Nnapi backend also targeting GPU / NPU has the same performances as the CPU backends.

So, we observe the GPU / NPU backends at best perform like the CPU backends and are several times slower in the worst cases.

We would like to know what could be the reason for this behaviour.

Regards

 

 

ラベル(1)
0 件の賞賛
返信
2 返答(返信)

1,024件の閲覧回数
Zhiming_Liu
NXP TechSupport
NXP TechSupport

The tensor of NPU can't support float input/output.It can support 8/16-bit integer Tensor data format and support 8, 16, 32-bit integer operations pipeline.

0 件の賞賛
返信

1,014件の閲覧回数
mbrundler
Contributor II

So, the NPU is not efficient on floating-point calculations, but

  • would the GPU perform better ?
  • if yes, is there a way to request calculations be scheduled to the GPU ?
0 件の賞賛
返信