iMX8 MPlus - NNAPI Delegate on NPU has different accuracy/correctnes than on CPU

NeuralBlue · ‎05-29-2021

We have a fully quantized (uint8) model to be run on iMX8MPlus.

When run on CPU, the inference gives back exactly the same neural activations that we algebraically expect (exactly the same of training phase).

Instead, on the NPU, the inference (via NNAPI Delegate) gives different results with different activations and in some rare cases, gives completely incorrect activations.

This is due (probably) to the accumulation of multiple internal approximations for some kind of operation(s). We obviously want that inference output on the NPU is exactly the same on the CPU and of training phase (on server). Any advice?

Are there technical info about NPU and haw it handles int8, uint8 and the relative accumulations int8xint8 and uint8xuint8? ( already asked here: https://community.nxp.com/t5/i-MX-Processors/iMX-8M-Plus-NPU-info-and-Arm-Compute-Library/m-p/124132...)

Thanks,
V.

NeuralBlue · ‎06-02-2021

Hi Bio_TICFSL,
thank you for your answer.

Our own neural net performs with a >99% TOP-1 Accuracy when executed on a CPU. We obviously use CPU/GPU during quant-aware training for loss calculation. It's critical to maintain the same >99% TOP-1 Accuracy on NPU.

In order to do this, we can consider (during training) to take account of the NPU extra precision and somehow simulate it, but we obviously need to understand very well how it works and how it can impact us. If you have other methods in mind to make the NPU give us exactly the same results we expect from training, please tell us (training on NPU? not so practical)

Furthermore, is this rounding error the only source of difference CPU/NPU?

Can you better explain this with an example?
"While CPU uses 32bit registers, the NPU uses 16bit register for normalized multiplier and 48bit post multiplier output during quantized inference. This way, the CPU suffers from double rounding error, while NPU does not. "

Thank you very much, very useful.
Regards,

NB

NeuralBlue · ‎05-31-2021

Making inferences with the said .tflite model by using (ArmNN + vsi_npu) instead of (TFLite + NNAPI) gives exactly the same (wrong) results.
This certainly means that there is a problem in the bottom blocks: NNRT or OVXLIB or OpenVX or Hardware.

NeuralBlue · ‎05-30-2021

As per page 12 of this manual dated 31 March 2021, https://www.nxp.com/docs/en/user-guide/IMX-MACHINE-LEARNING-UG.pdf

you can easily benchmark mobilenet_v1_1.0_224_quant.tflite on CPU and on NPU (--use_nnapi=true)

These are the result of inference:

The CPU activations are the "correct" ones, obtained by algebraically doing the calculations on any other computing platform for the same input image. This means that some approx is introduced by NNAPI delagation or NPU itself. Considering that this is an already quantized model, this is not good.

Hypotesis:
- per-layer quantizations vs per-tensor quantizations ?
- asymm vs symm quantization?
- int8 <-> uint8 conversions?

I kindly ask to NXP to clarify the source of the error in the calculations mobilenet_v1_1.0_224_quant.tflite. We can train better models then.

Thanks,
V.

Bio_TICFSL · ‎06-02-2021

Hello NeuralBlue,

Please check the answer provided here:

" CPU and NPU is different . While CPU uses 32bit registers, the NPU uses 16bit register for normalized multiplier and 48bit post multiplier output during quantized inference. This way, the CPU suffers from double rounding error, while NPU does not. "

Therefore the output is not equal.

Try to measure the overall accuracy difference of the model btw. CPU and NPU on larger dataset (not a single example). We did this kind of accuracy validation for PCQ models, with following results:

	CPU (4 cores; TF Lite)	VSI NPU (TF Lite; NN API)
	PCQ	PCQ
	Accuracy	Accuracy
	Top-1; Top-5	Top-1; Top-5
Mobilenet v1 1.0 224	70,80%; 88,20%	68,48%; 88,01%
Mobilenet v2 1.0 224	70,74%; 89,77%	70,75%; 89,75%
Efficientnet lite4 v2	77,30%; 94,00%	76,40%; 93,70%
Resnet v2 101 299	75,92%; 93,20%	76,25%; 93,31%

The highest difference we see for Top1 prediction is 2.32% for Mobilenet v1 (top 5 is 0.19%) and 0.3 for top5 prediction (Efficientnet model) "

Ignore the python use-case, the behavior is due to the HW precision btw CPU and NPU is different.

Let me know if further clarifications are needed.

Regards

Bio_TICFSL · ‎06-04-2021

Currently we are not aware of another root cause for the difference in accuracy.

I will check internally if we can share more details. Do you have NDA? is you have nda is better to have a internal ticket for it.

regards

iMX8 MPlus - NNAPI Delegate on NPU has different accuracy/correctnes than on CPU

iMX8 MPlus - NNAPI Delegate on NPU has different accuracy/correctnes than on CPU

i.MX 8 Family | i.MX 8QuadMax (8QM) | 8QuadPlus