Hi,
We're working on quantizing an ONNX model to run on the NPU on the iMX8MP, because we don't get sufficient performance running the float32 version (but it does work).
We're able to use `quantize_dynamic()` to quantize the model without errors or warnings, and it runs on the CPU using the CPUExecutionProvider, but it cannot run with the NnapiExecutionProvider, giving the following error:
onxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : model_builder.cc:238 RegisterInitializers NNAPI does not support scalar initializer
I haven't been able to find any information on this error, and don't really understand how I can go about addressing it. Don't fully understand what a scalar initializer is. Any additional suggestions or info would be appreciated.
Hello Colin,
i.MX 8M Plus NPU doesn't support float 32, so yes performance would not be good. Did you tried using eIQ model tools to quantize the model and run on NPU? Further we recommend TFLite run time for running the model for best performance. Can you share the model and exact steps you followed? Did you tried with quantize_static option?
Regards
From my recollection the eIQ tools are exclusively for image recognition workloads, so unfortunately are of no use to us. Please let me know if I have misunderstood that.
I'm afraid I can't share the model itself, but it is a fairly standard residual convolutional network with nothing particularly exotic going on. The code we used to quantize is as follows.
import sys
from onnxruntime.quantization import QuantType, quantize_dynamic
model_in = sys.argv[1]
model_out = sys.argv[2]
model_quant_dynamic = quantize_dynamic(
model_in,
model_out,
optimize_model=False,
weight_type=QuantType.QUInt8
)
We have been trying static quantization today using broadly similar code with the addition of the calibration data reader. But have come up against different issue. We are working on two methods in tandem: a Python script and a C++ programme. After static quantization:
Python:
* We cannot get the model to run at all on the NPU using the Python onnxruntime module: Error states "The graph is not acyclic".
* It runs well on the CPU, taking about 8 ms per inference. This is fast enough that (pending accuracy analysis) we might even consider abandoning NPU inference altogether, given the extreme difficulty in getting good results.
C++
* When running in C++ instead it works on both NPU and CPU, but is extraordinarily slow for reasons we haven't been able to discern. Running on NPU takes over 1000 ms! And on the CPU about 100 ms.
I can share the code if you like but it is arguably a different problem so perhaps a new thread would be better. Please let me know.
Hello,
Based on description, i am assuming it is Audio or time series based model. yes eIQ portal support Image based model. Though quantization might still work. Have customer evaluated TF Lite run time.
Can you share list of operator this model using? Can you share similar model or reference model name? As mentioned previously ONNX run time is not supported in similar way as we have enabled TFlite. So it would be hard to provide feedback with out knowing more details.
Regards