iMX8M Plus: onnxruntime_perf_test is slower on NPU than CPU

取消
显示结果 
显示  仅  | 搜索替代 
您的意思是: 
已解决

iMX8M Plus: onnxruntime_perf_test is slower on NPU than CPU

跳至解决方案
4,498 次查看
makotosato
Contributor II
 
iMX8M Plus: onnxruntime_perf_test is slower on NPU than CPU

Hi all,

I have the 8MPLUSLPD4-EVK Evaluation Kit and I am trying onnxruntime_perf_test according to "i.MX Machine Learning User's Guide, Rev. LF5.10.72_2.2.0, 17"
But onnxruntime_perf_test is slower on NPU than CPU.

i.MX Yocto Project(hardknott-5.10.72-2.2.0) is running on EVK.

Running on NPU

/usr/bin/onnxruntime-1.8.2/onnxruntime_perf_test /usr/bin/onnxruntime-1.8.2/squeezenet/model.onnx -r 1 -e vsi_npu 

Session creation time cost: 0.126173 s
Total time cost (including warm-up): 1.1651 s
Total inference requests: 2
Warm-up inference time cost: 744.977 ms
Average inference time cost (excluding warm-up): 420.121 ms
Total inference run time: 0.420148 s
Avg CPU usage: 0 %
Peak working set size: 81121280 bytes

Running on CPU

/usr/bin/onnxruntime-1.8.2/onnxruntime_perf_test /usr/bin/onnxruntime-1.8.2/squeezenet/model.onnx -r 1 -e cpu 

Session creation time cost: 0.0570905 s
Total time cost (including warm-up): 0.11501 s
Total inference requests: 2
Warm-up inference time cost: 58.0624 ms
Average inference time cost (excluding warm-up): 56.9481 ms
Total inference run time: 0.0569692 s
Avg CPU usage: 91 %
Peak working set size: 46661632 bytes

Is this correct?

I have attached onnxruntime_perf_test -v option log.
标记 (3)
0 项奖励
1 解答
4,430 次查看
HiramRTR
NXP Employee
NXP Employee

The reason why you're observing this behavior is because the model you're running on the NPU is an FP32 model. This you can verify by loading the ONNX model on Netron. The NPU is designed for accelerated inference on INT8. Therefore, what you see is actually an expected behavior. What you need to do is to quantize the FP32 model, and then deploy it on the NPU as the example suggests. Then you will see improved performance. 

I would suggest you take a look at the following example from the ONNXRuntime github repo:

https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/notebooks/imagene...

There they show how to go from a Pytorch MobileNetV2 FP32 model to a quantized ONNX model. Then you can take the output model and run it on the imx8 NPU

 

I hope this helps!

在原帖中查看解决方案

2 回复数
4,431 次查看
HiramRTR
NXP Employee
NXP Employee

The reason why you're observing this behavior is because the model you're running on the NPU is an FP32 model. This you can verify by loading the ONNX model on Netron. The NPU is designed for accelerated inference on INT8. Therefore, what you see is actually an expected behavior. What you need to do is to quantize the FP32 model, and then deploy it on the NPU as the example suggests. Then you will see improved performance. 

I would suggest you take a look at the following example from the ONNXRuntime github repo:

https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/notebooks/imagene...

There they show how to go from a Pytorch MobileNetV2 FP32 model to a quantized ONNX model. Then you can take the output model and run it on the imx8 NPU

 

I hope this helps!

4,404 次查看
makotosato
Contributor II

Thank you very much.
I will try that example.

And, I think you should update the Machine Learning User's Guide.

Best regards.

0 项奖励