NPU versus CPU Results and Training for Tensorflow lite

dwightk · ‎09-21-2023

Hello everyone,

We are having an issue on which we are hoping for assistance from the community.

The model is Tflite with float32 inputs and outputs. The rest of the layers are fully quantized with either int8 or int32 values.

If the model is run on CPU it gives a correct result, but when we run it on NPU, it gives a totally different and incorrect result. The result is way off and unusable. In fact, it has no discernible similarity with the CPU version. The classes and predictions are completely off - as opposed to being less precise or accurate but still in the ball park.

I have read elsewhere that it might have to do with rounding and precision issues, and that this has to be taken care of during training.

Can anyone point us to the appropriate materials or documentation relating to the use of the NPU. We've tried two different models and as soon as it switches to NPU the model doesn't behave consistently.

JosephAtNXP · ‎09-21-2023

Hi,

Thank you for your interest in NXP Semiconductor products,

This finding is from the research team, you could reanalyze the problem based on this,

Regards

Olivier_B · ‎10-11-2023

HI @JosephAtNXP

Like the other asked here, is this patch available?. I'm having similar issue on kirkstone branch.

Thanks!

JosephAtNXP · ‎10-12-2023

Hi,

You might get the patch but you'd need to create another case, preferably, a private technical one, and we'd ask for your model in order to test the patch with it due to some models (YOLOv5) improve with it and some others don't.

Thank you,

dwightk · ‎09-22-2023

Hi Joseph,

We have previously seen this link. What we are seeing is not an accuracy diminishment but completely different results. For example, if CPU predicts class scores of [2,2,0,0,0,0,0,0,0], npu is predicting [59,34,47,28,18...]. There is 0 accuracy.

We also found the following link:

https://community.nxp.com/t5/i-MX-Processors/Yolov5-Tflite-CPU-vs-VX-delegate-NPU/m-p/1557873#M19785...

In this the user appears to state there is a patch from NXP to correct a bug:

It took me two different tickets but I finally received a patch from NXP support that I applied to op_map.cc in the vx-delegate. That resolved the issue for us and we can now get good results from the NPU that are very close to the CPU results.

Are you able to comment on the existence of this patch in op_map.cc.

Was this incorporated into the newer versions of the BSP ?

dwightk · ‎09-22-2023

Hi all,

I tried the new BSP:

Linux imx8mpevk 6.1.22+g66e442bc7fdc #1 SMP PREEMPT Mon Jun 12 12:31:27 UTC 2023 aarch64 GNU/Linux

This fixed the problem. However, if I want to use my previous BSP (5.10 hardknott), can I copy over just the vx_delegate and tensorflow .so libraries to my other BSP version? Are there other dependencies ?

There are some issues with using the new BSP for us due to our custom Uboot, so I wanted to try and make this patch work.

Thanks.

JosephAtNXP · ‎09-27-2023

Hi,

Software team wants to review your model, could you send it?

We want to ensure that the patch works with hardknott due to it was for other BSP and also works for your model, the patch was intended to improve a particular model.

Thank you,

dwightk · ‎10-12-2023

We have switched to Mickledore:6.1.36 and the accuracy is now better on NPU than CPU. Speed is also better once warmed up, but warmup time is slower.