How to deploy YOLO8 on IMX8MP？

Joshua2 · ‎12-22-2024

Hi ，

HW: imx8mp-evk.
SW: LF_v5.10.72-2.2.0_images_IMX8MPEVK
PC: ubuntu20.04
Reference document: i.MX_Machine_Learning_User's_Guide.pdf

We are deploying YOLO8 on IMX8MP, but we are encountering issues.
URL: https://github.com/NXP/eiq-model-zoo.git
branch: main
commit: 58c2b002e9f64f39b8c43e896e00446298544a33

We refer to the README.md of eiq-model-zoo/tasks/visit/object-detection/yolov8

1) The script 'bash recipe. sh' was not found.
2）‘’yolo export model=yolov8n.pt imgsz=640 format=tflite int8 separate_outputs=True”, an error was reported.

///

imx8mp@E480:~/git/imx8mp/cyberbee/NFS/gst/ultralytics_yolov8$ yolo export model=yolov8n.pt imgsz=640 format=tflite int8 separate_outputs=True
Traceback (most recent call last):
File "/usr/local/bin/yolo", line 8, in <module>
sys.exit(entrypoint())
File "/home/xhwang/.local/lib/python3.8/site-packages/ultralytics/cfg/__init__.py", line 903, in entrypoint
check_dict_alignment(full_args_dict, overrides)
File "/home/xhwang/.local/lib/python3.8/site-packages/ultralytics/cfg/__init__.py", line 485, in check_dict_alignment
raise SyntaxError(string + CLI_HELP_MSG) from e
SyntaxError: 'separate_outputs' is not a valid YOLO argument.

Arguments received: ['yolo', 'export', 'model=yolov8n.pt', 'imgsz=640', 'format=tflite', 'int8', 'separate_outputs=True']. Ultralytics 'yolo' commands use the following syntax:

yolo TASK MODE ARGS

''''''''''''''''''''''''

Thanks,

Joshua

Joshua2 · ‎12-27-2024

Hi Zhiming,

The problem with YOLO's reasoning is still unresolved and we need your help.

IMAGE: LF_v6.6.52-2.2.0_images_IMX8MPEVK

I tried using NPU inference, but the speed was very slow. The CPU takes about 50ms, but the NPU requires 3500ms. Is there a problem with the configuration?

NPU:

python3 main.py --model yolov8n_full_integer_quant.tflite --img image.jpg --conf-thres 0.5 --iou-thres 0.5
INFO: Vx delegate: allowed_cache_mode set to 0.
INFO: Vx delegate: device num set to 0.
INFO: Vx delegate: allowed_builtin_code set to 0.
INFO: Vx delegate: error_during_init set to 0.
INFO: Vx delegate: error_during_prepare set to 0.
INFO: Vx delegate: error_during_invoke set to 0.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.

##########Inference time: 3446.0 ms

img_width 256 img_height 256
[[[ 2.6509 15.906 15.906 ... 145.8 178.94 243.89]
[ 7.9528 11.929 11.929 ... 198.82 185.57 189.54]
[ 6.6274 33.137 35.788 ... 214.73 145.8 59.646]
...
[ 0 0 0 ... 0 0 0]
[ 0 0 0 ... 0 0 0]
[ 0 0 0 ... 0 0 0]]]
[32.7551794052124, 240.449116230011, 771.6738510131836, 469.714515209198] 0.8750193 5
[57.9184627532959, 394.2247134447098, 162.1633529663086, 508.8573968410492] 0.7973549 0
[675.8167366683483, 455.7349169254303, 134.20414835214615, 419.387948513031] 0.7507562 0
[222.8777128458023, 402.6124691963196, 123.0204713344574, 447.3471450805664] 0.6316708 0

CPU:

python3 main.py --model yolov8n_full_integer_quant.tflite --img image.jpg --conf-thres 0.5 --iou-thres 0.5
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.

##########Inference time: 48.3 ms

img_width 256 img_height 256
[[[ 2.6509 15.906 15.906 ... 143.15 180.26 245.21]
[ 6.6274 11.929 11.929 ... 193.52 185.57 189.54]
[ 6.6274 33.137 35.788 ... 212.08 132.55 58.321]
...
[ 0 0 0 ... 0 0 0]
[ 0 0 0 ... 0 0 0]
[ 0 0 0 ... 0 0 0]]]
[32.7551794052124, 240.449116230011, 771.6738510131836, 469.714515209198] 0.8750193 5
[57.9184627532959, 394.2247134447098, 162.1633529663086, 508.8573968410492] 0.7973549 0
[678.6126579344273, 455.7349169254303, 128.61230581998825, 419.387948513031] 0.7507562 0
[225.67363411188126, 402.6124691963196, 117.4286288022995, 447.3471450805664] 0.6316708 0

Thanks,

Joshua

Zhiming_Liu · ‎12-22-2024

Hello,

Please download from here https://github.com/DeGirum/ultralytics_yolov8

How to get model

Note, that model is released under AGPL 3.0 license
visit DeGirum's GitHub repository and clone it
install all necessary dependencies
run following command to create fully quantized int8 model with separate outputs

yolo export model=yolov8n.pt imgsz=640 format=tflite int8 separate_outputs=True

The TFLite model file for i.MX 8M Plus and for i.MX 93 is yolov8n_full_integer_quant.tflite located in the yolov8n_saved_model directory.

Best Regards,
Zhiming

Joshua2 · ‎12-22-2024

Hi Zhiming,

Thank you for your reply!
I am using ultralytics_yolov8.

https://github.com/DeGirum/ultralytics_yolov8
branch:master
commit 75cab2e0c68723d4344c69a3bcd85265a582ab3d

hwang@E480:~/git/imx8mp/cyberbee/NFS/gst/ultralytics_yolov8$ yolo export model=yolov8n.pt imgsz=640 format=tflite int8 separate_outputs=True

Traceback (most recent call last):
File "/usr/local/bin/yolo", line 8, in <module>
sys.exit(entrypoint())
File "/home/xhwang/.local/lib/python3.8/site-packages/ultralytics/cfg/__init__.py", line 903, in entrypoint
check_dict_alignment(full_args_dict, overrides)
File "/home/xhwang/.local/lib/python3.8/site-packages/ultralytics/cfg/__init__.py", line 485, in check_dict_alignment
raise SyntaxError(string + CLI_HELP_MSG) from e
SyntaxError: 'separate_outputs' is not a valid YOLO argument.

Arguments received: ['yolo', 'export', 'model=yolov8n.pt', 'imgsz=640', 'format=tflite', 'int8', 'separate_outputs=True']. Ultralytics 'yolo' commands use the following syntax:

yolo TASK MODE ARGS

Where TASK (optional) is one of {'pose', 'detect', 'segment', 'obb', 'classify'}
MODE (required) is one of {'track', 'val', 'export', 'benchmark', 'train', 'predict'}
ARGS (optional) are any number of custom 'arg=value' pairs like 'imgsz=320' that override defaults.
See all ARGS at https://docs.ultralytics.com/usage/cfg or with 'yolo cfg'

1. Train a detection model for 10 epochs with an initial learning_rate of 0.01
yolo train data=coco8.yaml model=yolo11n.pt epochs=10 lr0=0.01

2. Predict a YouTube video using a pretrained segmentation model at image size 320:
yolo predict model=yolo11n-seg.pt source='https://youtu.be/LNwODJXcvt4' imgsz=320

3. Val a pretrained detection model at batch-size 1 and image size 640:
yolo val model=yolo11n.pt data=coco8.yaml batch=1 imgsz=640

4. Export a YOLO11n classification model to ONNX format at image size 224 by 128 (no TASK required)
yolo export model=yolo11n-cls.pt format=onnx imgsz=224,128

5. Streamlit real-time webcam inference GUI
yolo streamlit-predict

6. Ultralytics solutions usage
yolo solutions count or in ['heatmap', 'queue', 'speed', 'workout', 'analytics', 'trackzone'] source="path/to/video/file.mp4"

7. Run special commands:
yolo help
yolo checks
yolo version
yolo settings
yolo copy-cfg
yolo cfg
yolo solutions help

Docs: https://docs.ultralytics.com
Solutions: https://docs.ultralytics.com/solutions/
Community: https://community.ultralytics.com
GitHub: https://github.com/ultralytics/ultralytics

Thanks,

Joshua

Zhiming_Liu · ‎12-23-2024

Hello,

The code has changed, you can refer below commit. I think yolo export model=yolov8n.pt imgsz=640 format=tflite int8 is enough.

Best Regards,
Zhiming

Joshua2 · ‎12-24-2024

Hi Zhiming，

Thank you very much for your help. The conversion issue has been resolved.

I have encountered a new problem now, IMX8MP inference is very very slow!

Example program running:
ultralytics_yolov8/examples/YOLOv8-ONNXRuntime-CPP
Model conversion:
yolo export model=yolov8n.pt imgsz=640 format=onnx int8
compile:
mkdir build && cd build; cmake -D AARCH=TRUE ..; make
result:

params.cudaEnable 0
[YOLO_V8(CUDA)]: Cuda warm-up cost 2205.53 ms.
start Detector
img_path ../bus.jpg
[YOLO_V8(CUDA)]: 96.488ms pre-process, 2129.51ms inference, 17.911ms post-process.
res 4
label person 0.87 0.870000
label person 0.86 0.860000
label bus 0.86 0.860000
label person 0.82 0.820000

How can I optimize it? How to use GPU or NPU acceleration?

Thanks,

Joshua

Zhiming_Liu · ‎12-24-2024

Hello,

To appoint hardware accelerators , please refer 2.6.5 Using hardware accelerators in this guide.https://www.nxp.com/docs/en/user-guide/IMX-MACHINE-LEARNING-UG.pdf

Best Regards,
Zhiming

Joshua2 · ‎12-25-2024

Reference Documents

3.1 ONNX Runtime software stack

ONNX Runtime only supports CPU, which may be the reason for being too slow.

I tried using the TF model and referred to "ultralytics_yolov5/examples/YOLOv8OpenCV-int8-tflite Python",
1. Default interface:
interpreter = tflite.Interpreter(model_path=self.tflite_model)
##########Inference time: 1267.3 ms
2. Multi threaded optimization
eter = tflite.Interpreter(model_path=self.tflite_model, experimental_delegates=None, num_threads=4)
##########Inference time: 513.8 ms
3. How do I configure NPU and GPU inference?

Thanks,

Joshua

Zhiming_Liu · ‎12-29-2024

Hello,

Please appoint experimental_delegates="/usr/lib/libvx_delegate.so"

Best Regards,
Zhiming