Apply Deeplab+yolo examples on NPU - VERY SLOW

horst127 · ‎10-28-2021

Hey,

a few days I am now dealing with applying a custom model to the i.MX8 Plus NPU. I struggle with a custom object detection model which takes about 400 ms on the NPU and 800 on the CPU, where 3 Resizing layers are falling back to the CPU (only takes about 20 ms in total) and the REST of the time is taken from the NPU (the first sequence of operations about 200! ms).

However, since this is not reproducible in a public forum, I applied the included deeplab_v3 and yolo_v4 examples in the eIQ_toolkit. As the system architecture on the i.MX8 I used the newest release on your website. I quantized all your models using your eIQ GUI to int8 and run the cmd

$ /usr/bin/tensorflow-lite-2.4.1/examples# ./benchmark_model --graph=/home/user/deeplab_bilinear_best_int.tflite --use_nnapi=true

STARTING!
Log parameter values verbosely: [0]
Graph: [/home/bryan/deeplab_bilinear_best_int.tflite]
Use NNAPI: [1]
NNAPI accelerators available: [vsi-npu]
Loaded model /home/user/deeplab_bilinear_best_int.tflite
INFO: Created TensorFlow Lite delegate for NNAPI.
WARNING: Operator RESIZE_BILINEAR (v3) refused by NNAPI delegate: Operator refused due performance reasons.
WARNING: Operator RESIZE_BILINEAR (v3) refused by NNAPI delegate: Operator refused due performance reasons.
Explicitly applied NNAPI delegate, and the model graph will be partially executed by the delegate w/ 2 delegate kernels.
The input model file size (MB): 2.72458
Initialized session in 11.44ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=12650423

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=247367 curr=247257 min=245131 max=248995 avg=247147 std=637

Inference timings in us: Init: 11440, First inference: 12650423, Warmup (avg): 1.26504e+07, Inference (avg): 247147
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=3.95312 overall=50.3867

Summarized Profiler:

Operator-wise Profiling Info for Regular Benchmark Runs:
============================== Run Order ==============================
[node type] [start] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name]
TfLiteNnapiDelegate 0.000 125.411 125.517 50.774% 50.774% 0.000 1 [XXXX]:71

RESIZE_BILINEAR 125.518 15.918 16.052 6.494% 57.268% 0.000 1 [XXXX]:64
TfLiteNnapiDelegate 141.572 18.538 18.528 7.495% 64.763% 0.000 1 [XXX1]:72
RESIZE_BILINEAR 160.102 86.804 87.107 35.237% 100.000% 0.000 1 [Identity]:70

Deeplab took around 250 ms while yolov4 also took about 190 ms. That sounds very slow for me for a HW which I supposed to run neural networks. Is that a normal behaviour? If not, what is wrong? PS: The mobilenet_v1_1.0_224_quant.tflite runs with the documented ~2ms.

I am happy for any hints.

Zhiming_Liu · ‎10-31-2021

Can you share deeplab_bilinear_best_int.tflite? I think this should still be model issue.

horst127 · ‎11-01-2021

And here is the model created with tf-code (instead using the nxp eIQ Gui):

def load_images_float32(img_path):
    files = sorted(listdir(img_path))
    image_list = [np.asarray(Image.open(img_path + file_path).resize(size=(512,512)), dtype=np.float32) for
                  file_path in files]
    return tf.stack(image_list)
def representative_data_gen():
    images = load_images_float32(image_path)
    for input_value in tf.data.Dataset.from_tensor_slices(images).batch(1).take(100):
        yield [input_value]
tflite_filepath = pathlib.Path(save_path)
model = tf.keras.models.load_model(load_path)
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS,
 tf.lite.OpsSet.SELECT_TF_OPS]
tflite_model = converter.convert()
tflite_filepath.write_bytes(tflite_model)

Zhiming_Liu · ‎11-01-2021

@horst127

OK ,i will do some tests and give you feedback

horst127 · ‎11-01-2021

Thank you very much

horst127 · ‎11-01-2021

Sure, file is attached. This is created with the eIQ Toolkit. A tflite-version with tensorflow-python code is given in the other answer.

Zhiming_Liu · ‎11-01-2021

Hi @horst127

I have researched your model structure and mobilenet structure, your network structure is complex compared with mobilenet.

You should consider design lightweight network structure and perform network tailoring and knowledge distillation before achieving ideal performance on embedded devices

horst127 · ‎11-02-2021

Thanks for your reply. But this is not my custom model, it is a NXP example model provided with the eIQ Toolkit. I assumed that the example models are suited for the corresponding HW? Even the (also provided with the eIQ Toolkit) tiny yolo v4 runs with about 300ms (I think because a bunch of operations are not supported), which is about factor 10 times slower than reported numbers on a GPU. Is that normal?

Zhiming_Liu · ‎11-02-2021

I assumed that the example models are suited for the corresponding HW?

-->The newest eiq tools contains deeplab_bilinear_float.tflite and deeplab_nearest_float.tflite

As chapter 6.1 Image segmentation in EIQ user guide said,stil need Quantize the model to leverage its performance

1. Navigate to the workspace\models\deeplab_v3 folder.
2. Convert the "deeplab" model to RTM as follows:
deepview-converter deeplab_nearest_best.h5 deeplab_nearest_best.rtm
3. Quantize the model to leverage its performance benefits as follows:
deepview-converter --default_shape 1,512,512,3 --quantize ^ --quantize_format
uint8 --quant_normalization signed --samples imgs ^ deeplab_nearest_best.h5
deeplab_nearest_best_uint8.rtm
4. Run the Python script to see the result of the image segmentation as follows:
python runner_demo.py -m deeplab_nearest_best.rtm -i imgs\image1.jpg ^ -o
image1_out_nearest_best_rtm.jpg http://127.0.0.1:10818/v1
NXP Semiconductors
Model Zoo
eIQ Toolkit

horst127 · ‎11-05-2021

Do you have any other results than me?

Zhiming_Liu · ‎11-05-2021

Hi @horst127

Can you share how to get this benchmark results?

============================== Run Order ==============================
[node type] [start] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name]
TfLiteNnapiDelegate 0.000 125.411 125.517 50.774% 50.774% 0.000 1 [XXXX]:71

RESIZE_BILINEAR 125.518 15.918 16.052 6.494% 57.268% 0.000 1 [XXXX]:64
TfLiteNnapiDelegate 141.572 18.538 18.528 7.495% 64.763% 0.000 1 [XXX1]:72
RESIZE_BILINEAR 160.102 86.804 87.107 35.237% 100.000% 0.000 1 [Identity]:70

I am running the ./benchmark_model but i can't see the detailed results

horst127 · ‎11-05-2021

Hi,

you can run

$ ./benchmark_model --graph=<path_to_tflite> --use_nnapi=true --enable_op_profiling=true

Zhiming_Liu · ‎11-05-2021

Deeplab and mobilenet have different purposes, so the structure of the models is different – deeplab targets semantic segmentation while mobilenet targets detection. mobilenet has ‘224’ input size, this is other difference – the deeplab example has 512x512 pixels input size. This has a pretty big impact on inference time.

Since this is provided as an example by AuZone, Can you provide tyour expectations of this TfLiteNnapiDelegate time?I will contact with R&D team to check if this is possible.

horst127 · ‎11-05-2021

I do not have any expected times for deeplab. But if we consider the popular yolov4 (tiny) model, you can find a lot of reported GPU and CPU runtimes in the internet. But, as mentioned in my first post, running this provided example (320x320 input) on the i.mx 8m plus npu, is about (or even more) than 10 times slower (about 200ms per frame). That sounds very odd for me.

horst127 · ‎11-03-2021

I did the quantization, thats why I send you also the tensorflow code. Anyway, I copied exactly your command (step 3), but replaced .rtm with .tflite, and it needs 200ms, as I mentioned. To profile on the target, I ran $ modelrunner -H 10819 -e tflite -c 1. Do you get other runtimes?