iMX8M Plus: onnxruntime_perf_test is slower on NPU than CPU

makotosato · ‎02-13-2022

Hi all,

I have the 8MPLUSLPD4-EVK Evaluation Kit and I am trying onnxruntime_perf_test according to "i.MX Machine Learning User's Guide, Rev. LF5.10.72_2.2.0, 17"

But onnxruntime_perf_test is slower on NPU than CPU.

i.MX Yocto Project(hardknott-5.10.72-2.2.0) is running on EVK.

Running on NPU

/usr/bin/onnxruntime-1.8.2/onnxruntime_perf_test /usr/bin/onnxruntime-1.8.2/squeezenet/model.onnx -r 1 -e vsi_npu

Session creation time cost: 0.126173 s
Total time cost (including warm-up): 1.1651 s
Total inference requests: 2
Warm-up inference time cost: 744.977 ms
Average inference time cost (excluding warm-up): 420.121 ms
Total inference run time: 0.420148 s
Avg CPU usage: 0 %
Peak working set size: 81121280 bytes

Running on CPU

/usr/bin/onnxruntime-1.8.2/onnxruntime_perf_test /usr/bin/onnxruntime-1.8.2/squeezenet/model.onnx -r 1 -e cpu

Session creation time cost: 0.0570905 s
Total time cost (including warm-up): 0.11501 s
Total inference requests: 2
Warm-up inference time cost: 58.0624 ms
Average inference time cost (excluding warm-up): 56.9481 ms
Total inference run time: 0.0569692 s
Avg CPU usage: 91 %
Peak working set size: 46661632 bytes

Is this correct?

I have attached onnxruntime_perf_test -v option log.

HiramRTR · ‎02-23-2022

The reason why you're observing this behavior is because the model you're running on the NPU is an FP32 model. This you can verify by loading the ONNX model on Netron. The NPU is designed for accelerated inference on INT8. Therefore, what you see is actually an expected behavior. What you need to do is to quantize the FP32 model, and then deploy it on the NPU as the example suggests. Then you will see improved performance.

I would suggest you take a look at the following example from the ONNXRuntime github repo:

https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/notebooks/imagene...

There they show how to go from a Pytorch MobileNetV2 FP32 model to a quantized ONNX model. Then you can take the output model and run it on the imx8 NPU

I hope this helps!

View solution in original post

HiramRTR · ‎02-23-2022

The reason why you're observing this behavior is because the model you're running on the NPU is an FP32 model. This you can verify by loading the ONNX model on Netron. The NPU is designed for accelerated inference on INT8. Therefore, what you see is actually an expected behavior. What you need to do is to quantize the FP32 model, and then deploy it on the NPU as the example suggests. Then you will see improved performance.

I would suggest you take a look at the following example from the ONNXRuntime github repo:

https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/notebooks/imagene...

There they show how to go from a Pytorch MobileNetV2 FP32 model to a quantized ONNX model. Then you can take the output model and run it on the imx8 NPU

I hope this helps!

makotosato · ‎02-24-2022

Thank you very much.
I will try that example.

And, I think you should update the Machine Learning User's Guide.

Best regards.