/usr/bin/onnxruntime-1.8.2/onnxruntime_perf_test /usr/bin/onnxruntime-1.8.2/squeezenet/model.onnx -r 1 -e vsi_npu
Session creation time cost: 0.126173 s
Total time cost (including warm-up): 1.1651 s
Total inference requests: 2
Warm-up inference time cost: 744.977 ms
Average inference time cost (excluding warm-up): 420.121 ms
Total inference run time: 0.420148 s
Avg CPU usage: 0 %
Peak working set size: 81121280 bytes
/usr/bin/onnxruntime-1.8.2/onnxruntime_perf_test /usr/bin/onnxruntime-1.8.2/squeezenet/model.onnx -r 1 -e cpu
Session creation time cost: 0.0570905 s
Total time cost (including warm-up): 0.11501 s
Total inference requests: 2
Warm-up inference time cost: 58.0624 ms
Average inference time cost (excluding warm-up): 56.9481 ms
Total inference run time: 0.0569692 s
Avg CPU usage: 91 %
Peak working set size: 46661632 bytes
Solved! Go to Solution.
The reason why you're observing this behavior is because the model you're running on the NPU is an FP32 model. This you can verify by loading the ONNX model on Netron. The NPU is designed for accelerated inference on INT8. Therefore, what you see is actually an expected behavior. What you need to do is to quantize the FP32 model, and then deploy it on the NPU as the example suggests. Then you will see improved performance.
I would suggest you take a look at the following example from the ONNXRuntime github repo:
There they show how to go from a Pytorch MobileNetV2 FP32 model to a quantized ONNX model. Then you can take the output model and run it on the imx8 NPU
I hope this helps!
The reason why you're observing this behavior is because the model you're running on the NPU is an FP32 model. This you can verify by loading the ONNX model on Netron. The NPU is designed for accelerated inference on INT8. Therefore, what you see is actually an expected behavior. What you need to do is to quantize the FP32 model, and then deploy it on the NPU as the example suggests. Then you will see improved performance.
I would suggest you take a look at the following example from the ONNXRuntime github repo:
There they show how to go from a Pytorch MobileNetV2 FP32 model to a quantized ONNX model. Then you can take the output model and run it on the imx8 NPU
I hope this helps!
Thank you very much.
I will try that example.
And, I think you should update the Machine Learning User's Guide.
Best regards.