Yolov8l TFLite pretrained model invoke gives different result on cpu and npu(for both python C++)

yidema · ‎09-20-2023

Hi,

When i using yolov8l tflite pretrained quantized model(only input and output tensor is float, all other layers are int), with invoke on cpu and npu, it gives different result with same python or c++ script.

so on python this is output from yolov8 on CPU: it output a (1,84,8400):

[[[ 8.4694298e-03 2.5408305e-02 2.5408305e-02 ... 8.5541326e-01
9.0199518e-01 9.4857705e-01]
[ 1.6938869e-02 2.1173587e-02 2.1173587e-02 ... 9.8245484e-01
9.8245484e-01 9.8245484e-01]
[ 1.6938869e-02 4.2347182e-02 4.6581902e-02 ... 3.6418584e-01
3.4724694e-01 3.3454281e-01]
...
[-8.3819032e-09 -8.3819032e-09 -8.3819032e-09 ... -8.3819032e-09
-8.3819032e-09 -8.3819032e-09]
[-8.3819032e-09 -8.3819032e-09 -8.3819032e-09 ... -8.3819032e-09
-8.3819032e-09 -8.3819032e-09]
[-8.3819032e-09 -8.3819032e-09 -8.3819032e-09 ... -8.3819032e-09
-8.3819032e-09 -8.3819032e-09]]]

on NPU:

W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
[[[0.02933936 0.06706139 0.0461047 ... 1.0352602 1.0352602 1.0352602 ]
[0.00838267 0. 0.01676535 ... 0.85084134 0.96400744 0.8047366 ]
[0.07125273 0.14250545 0.06706139 ... 0.7879713 0.7418666 0.821502 ]
...
[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]]]

C++ have the same thing, using same script on cpu and npu, just change line using vxdelegate, different results.

We have tried very carefully about the preprocessing to make sure each pixel is the same float number in both python and C++, but using the same script on cpu and npu just give different results(only commented the line using libvxdelegate). if someone have any thoughts on this we will really appreciated!

best

brian14 · ‎09-22-2023

Hi @yidema,

Thank you for contacting NXP Support.

It is important to say that we are not expecting exactly the same result for CPU and NPU model execution due to the differences related to architectures and floating points values computation.

Could you please tell me your BSP version?

Please try our latest BSP version and tell me if this works to check other solutions.

Thank you and have a wonderful day!

View solution in original post

brian14 · ‎09-22-2023

Hi @yidema,

Thank you for contacting NXP Support.

It is important to say that we are not expecting exactly the same result for CPU and NPU model execution due to the differences related to architectures and floating points values computation.

Could you please tell me your BSP version?

Please try our latest BSP version and tell me if this works to check other solutions.

Thank you and have a wonderful day!

yidema · ‎09-25-2023

Hi,

As you suggested, we updated the new bsp version and it solved this issue, its even more accurate than cpu. Really appreciated for the fast help!

best

yide

yidema · ‎09-20-2023

in addition, we also tested yolov5 on C++, same thing, only change line of vxdelegate, all others are the same, gives different results:

On CPU:

Tensorflow Test
Reading image
IMAGE SIZE IS 281776
Reading image
IMAGE SIZE IS 348944
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Creating dst
Creating dst2
Creating dst3
Creating dst4
Creating dst5
Creating dst6
Creating dst7
IDX is 7
Rect is 1340.86 643.08 1782.28 1080.39
IDX is 5
Rect is 1484.66 141.859 1570.49 179.698

on npu:

Tensorflow Test
Reading image
IMAGE SIZE IS 281776
Reading image
IMAGE SIZE IS 348944
Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
ERROR: hybrid data type is not supported in conv2d.
ERROR: hybrid data type is not supported in conv2d.
ERROR: hybrid data type is not supported in conv2d.
ERROR: hybrid data type is not supported in conv2d.
Creating dst
Creating dst2
Creating dst3
Creating dst4
Creating dst5
Creating dst6
Creating dst7
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
IDX is 0
Rect is 1776 999 1776 999
IDX is 1
Rect is 1776 999 1776 999
IDX is 0
Rect is 1872 249.75 1872 1856.25
IDX is 0
Rect is 1872 641.25 1872 1464.75
IDX is 4
Rect is 1776 1107 1776 1107

Also want to mention for both yolov8l and yolov5, result from cpu is correct and from npu is wrong.