IMX8M plus NPU: Poor floating point 32 performance on VsiNPU

keithmok · ‎01-10-2021

I tried 12 tflite models from extracted from Android apk application called AI benchmark.
On IMX8M plus and running Google Tensorflow prebuilt benchmark linux aarch64 application from https://www.tensorflow.org/lite/performance/measurement

The models consist of floating 32bits and 8 bits quant version of twelve tflite models.

One tflite quant model cause the VsiNPU stack to crash, and one quant model failed to run. I created a separated ticket for that.

For floating point 32 bits,, the mobilenet v2 floating 32 bits takes only 10ms to run on a Qualcomm snapdragon GPU. While on VsiNPU it takes 388.826ms.
(38 times more slower than another SoC).

./linux_aarch64_benchmark_model --graph=ai_benchmark/mobilenet_v2_float.tflite --use_nnapi=true

STARTING!

Log parameter values verbosely: [0]

Graph: [ai_benchmark/mobilenet_v2_float.tflite]

Use NNAPI: [1]

NNAPI accelerators available: [vsi-npu]

Loaded model ai_benchmark/mobilenet_v2_float.tflite

INFO: Created TensorFlow Lite delegate for NNAPI.

Explicitly applied NNAPI delegate, and the model graph will be completely executed by the delegate.

The input model file size (MB): 14.0018

Initialized session in 4.694ms.

Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.

count=1 curr=1114535

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.

count=50 first=390045 curr=387986 min=387615 max=390176 avg=388826 std=1024

Inference timings in us: Init: 4694, First inference: 1114535, Warmup (avg): 1.11454e+06, Inference (avg): 388826

Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.

Peak memory footprint (MB): init=2.64453 overall=51.043

===========
Same model running on Android 10 using Qualcomm SoC:
Inference timings in us: Init: 161397, First inference: 10590, Warmup (avg): 12686.5, Inference (avg): 12838.2

jimmychan · ‎01-14-2021

This document may explain this problem.

https://www.nxp.com.cn/docs/en/application-note/AN12964.pdf

The software of AN12964 can be downloaded from here

https://www.nxp.com/products/processors-and-microcontrollers/arm-processors/i-mx-applications-proces...

keithmok · ‎01-14-2021

@jimmychan Did you read the log or even try it out yourself ?

Inference timings in us: Init: 4694, First inference: 1114535, Warmup (avg): 1.11454e+06, Inference (avg): 388826

The google tensroflow benchmark will skip the first wamrup inference time already which takes 1114535us , and the average is 388826 us (38 times slower than Qualcomm) which warmup time /iteration already skipped.