iMX8M Plus: onnxruntime_perf_test is slower on NPU than CPU

キャンセル
次の結果を表示 
表示  限定  | 次の代わりに検索 
もしかして: 

iMX8M Plus: onnxruntime_perf_test is slower on NPU than CPU

ソリューションへジャンプ
4,499件の閲覧回数
makotosato
Contributor II
 
iMX8M Plus: onnxruntime_perf_test is slower on NPU than CPU

Hi all,

I have the 8MPLUSLPD4-EVK Evaluation Kit and I am trying onnxruntime_perf_test according to "i.MX Machine Learning User's Guide, Rev. LF5.10.72_2.2.0, 17"
But onnxruntime_perf_test is slower on NPU than CPU.

i.MX Yocto Project(hardknott-5.10.72-2.2.0) is running on EVK.

Running on NPU

/usr/bin/onnxruntime-1.8.2/onnxruntime_perf_test /usr/bin/onnxruntime-1.8.2/squeezenet/model.onnx -r 1 -e vsi_npu 

Session creation time cost: 0.126173 s
Total time cost (including warm-up): 1.1651 s
Total inference requests: 2
Warm-up inference time cost: 744.977 ms
Average inference time cost (excluding warm-up): 420.121 ms
Total inference run time: 0.420148 s
Avg CPU usage: 0 %
Peak working set size: 81121280 bytes

Running on CPU

/usr/bin/onnxruntime-1.8.2/onnxruntime_perf_test /usr/bin/onnxruntime-1.8.2/squeezenet/model.onnx -r 1 -e cpu 

Session creation time cost: 0.0570905 s
Total time cost (including warm-up): 0.11501 s
Total inference requests: 2
Warm-up inference time cost: 58.0624 ms
Average inference time cost (excluding warm-up): 56.9481 ms
Total inference run time: 0.0569692 s
Avg CPU usage: 91 %
Peak working set size: 46661632 bytes

Is this correct?

I have attached onnxruntime_perf_test -v option log.
タグ(3)
0 件の賞賛
1 解決策
4,431件の閲覧回数
HiramRTR
NXP Employee
NXP Employee

The reason why you're observing this behavior is because the model you're running on the NPU is an FP32 model. This you can verify by loading the ONNX model on Netron. The NPU is designed for accelerated inference on INT8. Therefore, what you see is actually an expected behavior. What you need to do is to quantize the FP32 model, and then deploy it on the NPU as the example suggests. Then you will see improved performance. 

I would suggest you take a look at the following example from the ONNXRuntime github repo:

https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/notebooks/imagene...

There they show how to go from a Pytorch MobileNetV2 FP32 model to a quantized ONNX model. Then you can take the output model and run it on the imx8 NPU

 

I hope this helps!

元の投稿で解決策を見る

2 返答(返信)
4,432件の閲覧回数
HiramRTR
NXP Employee
NXP Employee

The reason why you're observing this behavior is because the model you're running on the NPU is an FP32 model. This you can verify by loading the ONNX model on Netron. The NPU is designed for accelerated inference on INT8. Therefore, what you see is actually an expected behavior. What you need to do is to quantize the FP32 model, and then deploy it on the NPU as the example suggests. Then you will see improved performance. 

I would suggest you take a look at the following example from the ONNXRuntime github repo:

https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/notebooks/imagene...

There they show how to go from a Pytorch MobileNetV2 FP32 model to a quantized ONNX model. Then you can take the output model and run it on the imx8 NPU

 

I hope this helps!

4,405件の閲覧回数
makotosato
Contributor II

Thank you very much.
I will try that example.

And, I think you should update the Machine Learning User's Guide.

Best regards.

0 件の賞賛