Dear team,
I'm using the Linux 6.1.1_1.0.0 SDK on i.MX8M+ custom board.
I tried to run the validation script available on Ultralytics yolov5 repsoitory on open source yolov5s model on a detection dataset.
Class Images Instances P R mAP50 mAP50-95: 100%| 128/128 [04:46<00:00, 2.24s/it]
all 128 929 0.726 0.581 0.679 0.427
Speed: 3.9ms pre-process, 2181.9ms inference, 27.2ms NMS per image at shape (1, 3, 640, 640)
The inference time observed is 2seconds.Is it due to default inference backend is CPU from Tflite implementation?
How can we enable/use the GPU/NPU hardware accelerator using the VX Delegate on i.MX8M+ ?
Any help is appreciated.
Regards
Amal
Hi @Amal_Antony3331,
For this situation, I have two suggestions:
The first one is to use the benchmark model located on the current BSP.
This benchmark model will allow you to get the average inference time for your model.
You can use as follow:
1. Go to the TensorFlow Lite examples folder:
$ cd /usr/bin/tensorflow-lite-2.x.x/examples
2. Perform benchmark using CPU with 4 cores running.
$ ./benchmark_model --graph=yolov5s-32fp-256.tflite --num_runs=50 --num_threads=4
3. Perform benchmark using NPU with VX Delegate.
$ ./benchmark_model --graph=yolov5s-32fp-256.tflite --num_runs=50 --external_delegate_path=/usr/lib/libvx_delegate.so
With steps 2 and 3 you will see a difference between inference times around 100ms (Around 125ms for CPU and around 30ms for NPU).
The second suggestion is to use GStreamer + NNStreamer. With these tools, you will be able to run the model on live video streaming. But also you can modify the pipeline to run on a static image.
The important part of this pipeline that allows it to run on the NPU is "custom=Delegate:External,ExtDelegateLib:libvx_delegate.so".
Pipeline example:
$ gst-launch-1.0 --no-position v4l2src device=/dev/video3 ! \ video/x-raw,width=640,height=480,framerate=30/1! \ tee name=t t. ! queue max-size-buffers=2 leaky=2 ! \ imxvideoconvert_g2d ! video/x-raw,width=256,height=256,format=RGBA ! \ videoconvert ! video/x-raw,format=RGB ! \ tensor_converter ! \ tensor_filter framework=tensorflow-lite model=yolov5s_quant_256.tflite \ custom=Delegate:External,ExtDelegateLib:libvx_delegate.so ! \ tensor_decoder mode=bounding_boxes option1=yolov5 option2=coco_label.txt \ option4=640:480 option5=256:256 ! \ mix. t. ! queue max-size-buffers=2 ! \ imxcompositor_g2d name=mix sink_0::zorder=2 sink_1::zorder=1 ! waylandsink
I hope this information will be helpful.
Have a great day!
Hi @brian14
Thank you so much for the response.
One more query for the clarification: I'm using a custom trained yolov5 model. So If I try with benchmark_model to get the average inference time, on which dataset/labels this operation is performing?
Also, as benchmark_model uses vx_delegate, the val.py (from yolov5 repository) should also be able to use it right
Hi @Amal_Antony3331,
The benchmark_model is part of a benchmark tool provided by TensorFlow framework. For this benchmark_model you don't need to specify an input data or labels file.
You will find more information about benchmark tools for TensorFlow Lite at the following link:
Performance measurement | TensorFlow Lite
Also, you will find information about benchmark_model command line and how to use it at this link:
tensorflow/tensorflow/lite/tools/benchmark/README.md at master · tensorflow/tensorflow · GitHub
Finally, I'm not sure about the val.py, I'm assuming that you are exporting and using the model yolov5 in a .tflite format to run using gstreamer and nnstreamer in videos or photos.
I hope this information will be helpful.
Have a great day!
Hi @brian14
Thanks for the response.
Could you please guide me on how to enable vx_delegate for a custom python code written for inferencing and benchmarking a .tflite model.
Hi @Amal_Antony3331,
I'm not sure about your request, but I think two forms:
For the first one, I'm assuming that you are trying to automate inferencing and benchmarking for the same model. In that case, you could write a script in Python or bash scripting applying a GStreamer pipeline or implementing the benchmark_model as the first example.
You can follow this guide:
Python GStreamer Tutorial (brettviren.github.io)
On the other hand, I'm assuming that you need to develop a custom Python code based on a model such as yolo, mobilenet, resnet, etc., and then convert it to tflite. For this case, you can follow these guides:
Model conversion overview | TensorFlow Lite
Tensorflow Lite Converter Example!! | by Maheshwar Ligade | techwasti | Medium
Finally, you can check about our eIQ software, with this tool you will be able to convert your models, quantize or train a model.
eIQ® Toolkit | NXP Semiconductors
Note: Take into account that you require to always export to tflite model to properly apply the VX Delegate.
Best regards, Brian.
Hi @brian14
I can provide you some more points to clarify my concern.
I have a custom .tflite model that we have trained on specific classes according to our requirement.
I tried to run the same .tflite model on two BSP versions:
Case 1:
root@imx8mpevk:/usr/lib/python3.9/yolov5# uname -a
Linux imx8mpevk 5.10.35-lts-5.10.y+gdd2583ce6e52 #1 SMP PREEMPT Tue Jun 8 14:42:10 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
I'm able to run the val.py in vx-delegate
root@imx8mpevk:/usr/lib/python3.9/yolov5#python3 val.py --weights custom_model-int8.tflite --data data/coco128_custom.yaml --img 64
As expected getting some delegate logs also. See below
YOLOv5 � v7.0-23-g5dc1ce4 Python-3.9.4 torch-1.7.1 CPU
Loading custom_model-int8.tflite for TensorFlow Lite inference...
Initialize supported_builtins
Check Resize(0)
Check Resize(0)
Check StridedSlice
Check StridedSlice
Check StridedSlice
Check StridedSlice
Check StridedSlice
Check StridedSlice
Check StridedSlice
Check StridedSlice
Check StridedSlice
vx_delegate Delegate::Init
Initialize supported_builtins
Delegate::Prepare node:0xaaaae35aae40
Applied VX delegate.
Speed: 3.7ms pre-process, 223.5ms inference, 4.3ms NMS per image at shape (1, 3, 640, 640)
Inference time observed is 223.5 ms
Case 2:
root@imx8mpevk:~/benchmark/yolov5# uname -a
Linux imx8mpevk 6.1.1+g29549c7073bf #1 SMP PREEMPT Thu Mar 2 14:54:17 UTC 2023 aarch64 GNU/Linux
I tried running the val.py with same all conditions remaining same as above.
root@imx8mpevk:~/benchmark/yolov5# python3 val.py --weights custom_model-int8.tflite --data data/custom_coco128.yaml --img 640
YOLOv5 � v7.0-23-g5dc1ce4 Python-3.10.6 torch-1.11.0 CPU
Loading custom_model-int8.tflite for TensorFlow Lite inference...
Forcing --batch-size 1 square inference (1,3,640,640) for non-PyTorch models
Speed: 4.1ms pre-process, 2070.1ms inference, 2.8ms NMS per image at shape (1, 3, 640, 640)
Here the inference time observed is 2070 ms. Also no logs related to vx_delegate.
If the exported .tflite model works on one BSP version, it should work on latest BSP version also right.
What could be the possible reason for this behavior?
Thanks in advance
Hi @Amal_Antony3331,
Thank you for your clarification.
The reason could be that you are not specifying the external delegate. You can do it, using an argument for the external delegate.
In our BSP examples for TensorFlow Lite you will find the example label_image.py, you can base on that code to implement the external delegate, or you can use this example to implement your application using the arguments --image to set an image path, --model_file to set the model, --ext_delegate to set the external delegate.
Example of arguments parser in Python used on label_image.py for external delegate.
I made a test and these are the results:
$ python3 label_image.py
$ python3 label_image.py --ext_delegate=/usr/lib/libvx_delegate.so
You can see the difference between unset external delegate and set up for use of the NPU on iMX8M Plus 157.3 to 4.0ms.
In conclusion, you will need to review the label_image.py to implement the argument parser and the code section used to load the external delegate in your val.py file.
Link to the label_image.py code: tensorflow/tensorflow/lite/examples/python/label_image.py at master · tensorflow/tensorflow · GitHub
I hope this answer will be helpful.
Best regards, Brian.
Hi @brian14
I have modified the Modified the common.py (https://github.com/ultralytics/yolov5/blob/master/models/common.py#L457) as follows:
'Linux': '/usr/lib/libvx_delegate.so' instead of 'Linux': 'libedgetpu.so.1',
Run the val.py script with the opensource yolov5s.tflite model(renamed as yolov5s-int8_edgetpu.tflite).
$ python3 val.py --weights yolov5s-int8_edgetpu.tflite --data data/coco128.yaml --img 640
it is observed that some logs related to VX delegate are printed, but the mAP value is Zero.
root@imx8mpevk:~/benchmark/yolov5# python3 val.py --weights yolov5s-int8_edgetpu.tflite --data data/coco128.yaml --img 640
/usr/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
warn(f"Failed to load image Python extension: {e}")
val: data=data/coco128.yaml, weights=['yolov5s-int8_edgetpu.tflite'], batch_size=32, imgsz=640, conf_thres=0.001, iou_thres=0.6, max_det=300, task=val, device=, workers=8, single_cls=
False, augment=False, verbose=False, save_txt=False, save_hybrid=False, save_conf=False, save_json=False, project=runs/val, name=exp, exist_ok=False, half=False, dnn=False
YOLOv5 <li-emoji id="lia_rocket" src=" " class="lia-unicode-emoji" title=":rocket:"></li-emoji> v7.0-23-g5dc1ce4 Python-3.10.6 torch-1.11.0 CPU
Loading yolov5s-int8_edgetpu.tflite for TensorFlow Lite Edge TPU inference...
Vx delegate: allowed_cache_mode set to 0.
Vx delegate: device num set to 0.
Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.
Forcing --batch-size 1 square inference (1,3,640,640) for non-PyTorch models
val: Scanning /home/root/benchmark/datasets/coco128/labels/train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]
Class Images Instances P R mAP50 mAP50-95: 0%| | 0/128 [00:00<?, ?it/s]W [HandleLayoutInfer:278]Op 162: default layout infere
nce pass.
W [HandleLayoutInfer:278]Op 162: default layout inference pass.
W [HandleLayoutInfer:278]Op 162: default layout inference pass.
W [HandleLayoutInfer:278]Op 162: default layout inference pass.
W [HandleLayoutInfer:278]Op 162: default layout inference pass.
W [HandleLayoutInfer:278]Op 162: default layout inference pass.
Class Images Instances P R mAP50 mAP50-95: 16%|█▋ | 21/128 [00:27<00:28, 3.71it/s]WARNING <li-emoji id="lia_warning" src=" " class="lia-unicode-emoji" title=":warning:"></li-emoji> NMS time limit 0.550s exce
eded
Class Images Instances P R mAP50 mAP50-95: 18%|█▊ | 23/128 [00:28<00:57, 1.84it/s]WARNING <li-emoji id="lia_warning" src=" " class="lia-unicode-emoji" title=":warning:"></li-emoji> NMS time limit 0.550s exce
eded
Class Images Instances P R mAP50 mAP50-95: 26%|██▌ | 33/128 [00:32<00:24, 3.93it/s]WARNING <li-emoji id="lia_warning" src=" " class="lia-unicode-emoji" title=":warning:"></li-emoji> NMS time limit 0.550s ex
ceeded
Class Images Instances P R mAP50 mAP50-95: 29%|██▉ | 37/128 [00:34<00:32, 2.79it/s]WARNING <li-emoji id="lia_warning" src=" " class="lia-unicode-emoji" title=":warning:"></li-emoji> NMS time limit 0.550s ex
ceeded
Class Images Instances P R mAP50 mAP50-95: 34%|███▎ | 43/128 [00:38<00:33, 2.53it/s]WARNING <li-emoji id="lia_warning" src=" " class="lia-unicode-emoji" title=":warning:"></li-emoji> NMS time limit 0.550s
exceeded
Class Images Instances P R mAP50 mAP50-95: 37%|███▋ | 47/128 [00:39<00:27, 2.96it/s]WARNING <li-emoji id="lia_warning" src=" " class="lia-unicode-emoji" title=":warning:"></li-emoji> NMS time limit 0.550s
exceeded
Class Images Instances P R mAP50 mAP50-95: 41%|████ | 52/128 [00:42<00:28, 2.64it/s]WARNING <li-emoji id="lia_warning" src=" " class="lia-unicode-emoji" title=":warning:"></li-emoji> NMS time limit 0.550s
exceeded
Class Images Instances P R mAP50 mAP50-95: 80%|███████▉ | 102/128 [00:57<00:09, 2.74it/s]WARNING <li-emoji id="lia_warning" src=" " class="lia-unicode-emoji" title=":warning:"></li-emoji> NMS time limi
t 0.550s exceeded
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 128/128 [01:04<00:00, 1.98it/s]
all 128 929 0 0 0 0
Speed: 4.2ms pre-process, 344.9ms inference, 135.9ms NMS per image at shape (1, 3, 640, 640)
Thank you for your reply.
For your answer I can see that you are using a model compiled for Edge TPU. This is the Tensor Processor Unit for Google devices, and it is not compatible with iMX8MPlus NPU.
I think you will need to work more with common.py to effectively use our NPU. I could see that the common.py you are using is prepared to work with other embedded systems, especially with TPU in Coral Device by Google.
The suggestion is still to use the label_image.py script described on my last reply and implement YOLOv5 model in a TensorFlow Lite model format.
Have a great day!