Hi,
When i using yolov8l tflite pretrained quantized model(only input and output tensor is float, all other layers are int), with invoke on cpu and npu, it gives different result with same python or c++ script.
so on python this is output from yolov8 on CPU: it output a (1,84,8400):
[[[ 8.4694298e-03 2.5408305e-02 2.5408305e-02 ... 8.5541326e-01
9.0199518e-01 9.4857705e-01]
[ 1.6938869e-02 2.1173587e-02 2.1173587e-02 ... 9.8245484e-01
9.8245484e-01 9.8245484e-01]
[ 1.6938869e-02 4.2347182e-02 4.6581902e-02 ... 3.6418584e-01
3.4724694e-01 3.3454281e-01]
...
[-8.3819032e-09 -8.3819032e-09 -8.3819032e-09 ... -8.3819032e-09
-8.3819032e-09 -8.3819032e-09]
[-8.3819032e-09 -8.3819032e-09 -8.3819032e-09 ... -8.3819032e-09
-8.3819032e-09 -8.3819032e-09]
[-8.3819032e-09 -8.3819032e-09 -8.3819032e-09 ... -8.3819032e-09
-8.3819032e-09 -8.3819032e-09]]]
on NPU:
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
[[[0.02933936 0.06706139 0.0461047 ... 1.0352602 1.0352602 1.0352602 ]
[0.00838267 0. 0.01676535 ... 0.85084134 0.96400744 0.8047366 ]
[0.07125273 0.14250545 0.06706139 ... 0.7879713 0.7418666 0.821502 ]
...
[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]]]
C++ have the same thing, using same script on cpu and npu, just change line using vxdelegate, different results.
We have tried very carefully about the preprocessing to make sure each pixel is the same float number in both python and C++, but using the same script on cpu and npu just give different results(only commented the line using libvxdelegate). if someone have any thoughts on this we will really appreciated!
best
Solved! Go to Solution.
Hi @yidema,
Thank you for contacting NXP Support.
It is important to say that we are not expecting exactly the same result for CPU and NPU model execution due to the differences related to architectures and floating points values computation.
Could you please tell me your BSP version?
Please try our latest BSP version and tell me if this works to check other solutions.
Thank you and have a wonderful day!
Hi @yidema,
Thank you for contacting NXP Support.
It is important to say that we are not expecting exactly the same result for CPU and NPU model execution due to the differences related to architectures and floating points values computation.
Could you please tell me your BSP version?
Please try our latest BSP version and tell me if this works to check other solutions.
Thank you and have a wonderful day!
Hi,
As you suggested, we updated the new bsp version and it solved this issue, its even more accurate than cpu. Really appreciated for the fast help!
best
yide
in addition, we also tested yolov5 on C++, same thing, only change line of vxdelegate, all others are the same, gives different results:
On CPU:
Tensorflow Test
Reading image
IMAGE SIZE IS 281776
Reading image
IMAGE SIZE IS 348944
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Creating dst
Creating dst2
Creating dst3
Creating dst4
Creating dst5
Creating dst6
Creating dst7
IDX is 7
Rect is 1340.86 643.08 1782.28 1080.39
IDX is 5
Rect is 1484.66 141.859 1570.49 179.698
on npu:
Tensorflow Test
Reading image
IMAGE SIZE IS 281776
Reading image
IMAGE SIZE IS 348944
Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
ERROR: hybrid data type is not supported in conv2d.
ERROR: hybrid data type is not supported in conv2d.
ERROR: hybrid data type is not supported in conv2d.
ERROR: hybrid data type is not supported in conv2d.
Creating dst
Creating dst2
Creating dst3
Creating dst4
Creating dst5
Creating dst6
Creating dst7
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
IDX is 0
Rect is 1776 999 1776 999
IDX is 1
Rect is 1776 999 1776 999
IDX is 0
Rect is 1872 249.75 1872 1856.25
IDX is 0
Rect is 1872 641.25 1872 1464.75
IDX is 4
Rect is 1776 1107 1776 1107
Also want to mention for both yolov8l and yolov5, result from cpu is correct and from npu is wrong.