Apply Deeplab+yolo examples on NPU - VERY SLOW

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Apply Deeplab+yolo examples on NPU - VERY SLOW

3,522 Views
horst127
Contributor III

Hey,

a few days I am now dealing with applying a custom model to the i.MX8 Plus NPU. I struggle with a custom object detection model which takes about 400 ms on the NPU and 800 on the CPU, where 3 Resizing layers are falling back to the CPU (only takes about 20 ms in total) and the REST of the time is taken from the NPU (the first sequence of operations about 200! ms).

However, since this is not reproducible in a public forum, I applied the included deeplab_v3 and yolo_v4 examples in the eIQ_toolkit. As the system architecture on the i.MX8 I used the newest release on your website. I quantized all your models using your eIQ GUI to int8 and run the cmd

$ /usr/bin/tensorflow-lite-2.4.1/examples# ./benchmark_model --graph=/home/user/deeplab_bilinear_best_int.tflite --use_nnapi=true

STARTING!
Log parameter values verbosely: [0]
Graph: [/home/bryan/deeplab_bilinear_best_int.tflite]
Use NNAPI: [1]
NNAPI accelerators available: [vsi-npu]
Loaded model /home/user/deeplab_bilinear_best_int.tflite
INFO: Created TensorFlow Lite delegate for NNAPI.
WARNING: Operator RESIZE_BILINEAR (v3) refused by NNAPI delegate: Operator refused due performance reasons.
WARNING: Operator RESIZE_BILINEAR (v3) refused by NNAPI delegate: Operator refused due performance reasons.
Explicitly applied NNAPI delegate, and the model graph will be partially executed by the delegate w/ 2 delegate kernels.
The input model file size (MB): 2.72458
Initialized session in 11.44ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=12650423

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=247367 curr=247257 min=245131 max=248995 avg=247147 std=637

Inference timings in us: Init: 11440, First inference: 12650423, Warmup (avg): 1.26504e+07, Inference (avg): 247147
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=3.95312 overall=50.3867

Summarized Profiler:

Operator-wise Profiling Info for Regular Benchmark Runs:
============================== Run Order ==============================
[node type] [start] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name]
TfLiteNnapiDelegate 0.000 125.411 125.517 50.774% 50.774% 0.000 1 [XXXX]:71

RESIZE_BILINEAR 125.518 15.918 16.052 6.494% 57.268% 0.000 1 [XXXX]:64
TfLiteNnapiDelegate 141.572 18.538 18.528 7.495% 64.763% 0.000 1 [XXX1]:72
RESIZE_BILINEAR 160.102 86.804 87.107 35.237% 100.000% 0.000 1 [Identity]:70

Deeplab took around 250 ms while yolov4 also took about 190 ms. That sounds very slow for me for a HW which I supposed to run neural networks. Is that a normal behaviour? If not, what is wrong? PS: The mobilenet_v1_1.0_224_quant.tflite runs with the documented ~2ms.

I am happy for any hints.

Tags (1)
14 Replies

3,485 Views
Zhiming_Liu
NXP TechSupport
NXP TechSupport

Can you share deeplab_bilinear_best_int.tflite? I think this should still be model issue.

0 Kudos

3,476 Views
horst127
Contributor III

And here is the model created with tf-code (instead using the nxp eIQ Gui):

def load_images_float32(img_path):
files = sorted(listdir(img_path))
image_list = [np.asarray(Image.open(img_path + file_path).resize(size=(512,512)), dtype=np.float32) for
file_path in files]
return tf.stack(image_list)
def representative_data_gen():
images = load_images_float32(image_path)
for input_value in tf.data.Dataset.from_tensor_slices(images).batch(1).take(100):
yield [input_value]
tflite_filepath = pathlib.Path(save_path)
model = tf.keras.models.load_model(load_path)
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS,
tf.lite.OpsSet.SELECT_TF_OPS]
tflite_model = converter.convert()
tflite_filepath.write_bytes(tflite_model)

 

3,474 Views
Zhiming_Liu
NXP TechSupport
NXP TechSupport

@horst127 

OK ,i will do some tests and give you feedback

3,471 Views
horst127
Contributor III

Thank you very much

0 Kudos

3,480 Views
horst127
Contributor III

Sure, file is attached. This is created with the eIQ Toolkit. A tflite-version with tensorflow-python code is given in the other answer.

0 Kudos

3,454 Views
Zhiming_Liu
NXP TechSupport
NXP TechSupport

Hi @horst127 

I have researched your model structure and mobilenet structure, your network structure is complex compared with mobilenet.

You should consider design lightweight network structure and perform network tailoring and knowledge distillation before achieving ideal performance on embedded devices

3,447 Views
horst127
Contributor III

Thanks for your reply. But this is not my custom model, it is a NXP example model provided with the eIQ Toolkit. I assumed that the example models are suited for the corresponding HW? Even the (also provided with the eIQ Toolkit) tiny yolo v4 runs with about 300ms (I think because a bunch of operations are not supported), which is about factor 10 times slower than reported numbers on a GPU. Is that normal?

0 Kudos

3,423 Views
Zhiming_Liu
NXP TechSupport
NXP TechSupport

I assumed that the example models are suited for the corresponding HW?

-->The newest eiq tools contains deeplab_bilinear_float.tflite and deeplab_nearest_float.tflite

As chapter 6.1 Image segmentation in EIQ user guide said,stil need Quantize the model to leverage its performance

1. Navigate to the workspace\models\deeplab_v3 folder.
2. Convert the "deeplab" model to RTM as follows:
deepview-converter deeplab_nearest_best.h5 deeplab_nearest_best.rtm
3. Quantize the model to leverage its performance benefits as follows:
deepview-converter --default_shape 1,512,512,3 --quantize ^ --quantize_format
uint8 --quant_normalization signed --samples imgs ^ deeplab_nearest_best.h5
deeplab_nearest_best_uint8.rtm
4. Run the Python script to see the result of the image segmentation as follows:
python runner_demo.py -m deeplab_nearest_best.rtm -i imgs\image1.jpg ^ -o
image1_out_nearest_best_rtm.jpg http://127.0.0.1:10818/v1
NXP Semiconductors
Model Zoo
eIQ Toolkit

 

0 Kudos

3,394 Views
horst127
Contributor III

Do you have any other results than me?

0 Kudos

3,385 Views
Zhiming_Liu
NXP TechSupport
NXP TechSupport

Hi @horst127 

Can you share how to get this benchmark results?

============================== Run Order ==============================
[node type] [start] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name]
TfLiteNnapiDelegate 0.000 125.411 125.517 50.774% 50.774% 0.000 1 [XXXX]:71

RESIZE_BILINEAR 125.518 15.918 16.052 6.494% 57.268% 0.000 1 [XXXX]:64
TfLiteNnapiDelegate 141.572 18.538 18.528 7.495% 64.763% 0.000 1 [XXX1]:72
RESIZE_BILINEAR 160.102 86.804 87.107 35.237% 100.000% 0.000 1 [Identity]:70

I am running the  ./benchmark_model but i can't see the detailed results

0 Kudos

3,383 Views
horst127
Contributor III

Hi,

you can run

$ ./benchmark_model --graph=<path_to_tflite> --use_nnapi=true --enable_op_profiling=true

3,371 Views
Zhiming_Liu
NXP TechSupport
NXP TechSupport

Deeplab and mobilenet have different purposes, so the structure of the models is different – deeplab targets semantic segmentation while mobilenet targets detection. mobilenet has ‘224’ input size, this is other difference – the deeplab example has 512x512 pixels input size. This has a pretty big impact on inference time.

Since this is provided as an example by AuZone,  Can you provide tyour expectations of  this TfLiteNnapiDelegate time?I will contact with R&D team to check if this is possible.

3,360 Views
horst127
Contributor III

I do not have any expected times for deeplab. But if we consider the popular yolov4 (tiny) model, you can find a lot of reported GPU and CPU runtimes in the internet. But, as mentioned in my first post, running this provided example (320x320 input) on the i.mx 8m plus npu, is about (or even more) than 10 times slower (about 200ms per frame). That sounds very odd for me.

0 Kudos

3,418 Views
horst127
Contributor III

I did the quantization, thats why I send you also the tensorflow code. Anyway, I copied exactly your command (step 3), but replaced .rtm with .tflite, and it needs 200ms, as I mentioned. To profile on the target, I ran $ modelrunner -H 10819 -e tflite -c 1. Do you get other runtimes?