IMX8M plus NPU: Poor floating point 32 performance on VsiNPU

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

IMX8M plus NPU: Poor floating point 32 performance on VsiNPU

1,378 Views
keithmok
Contributor II

I tried 12 tflite models from extracted from Android apk application called AI benchmark.
On IMX8M plus and running Google Tensorflow prebuilt benchmark linux aarch64 application from https://www.tensorflow.org/lite/performance/measurement

The models consist of floating 32bits  and 8 bits quant version of twelve  tflite models.

One tflite quant model cause the VsiNPU stack to crash, and one quant model failed to run. I created a separated ticket for that.

For floating point 32 bits,, the mobilenet v2 floating 32 bits takes only 10ms to run on a Qualcomm snapdragon GPU. While on VsiNPU it takes 388.826ms.
(38 times more slower than another SoC).

./linux_aarch64_benchmark_model  --graph=ai_benchmark/mobilenet_v2_float.tflite  --use_nnapi=true 

STARTING!

Log parameter values verbosely: [0]

Graph: [ai_benchmark/mobilenet_v2_float.tflite]

Use NNAPI: [1]

NNAPI accelerators available: [vsi-npu]

Loaded model ai_benchmark/mobilenet_v2_float.tflite

INFO: Created TensorFlow Lite delegate for NNAPI.

Explicitly applied NNAPI delegate, and the model graph will be completely executed by the delegate.

The input model file size (MB): 14.0018

Initialized session in 4.694ms.

Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.

count=1 curr=1114535

 

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.

count=50 first=390045 curr=387986 min=387615 max=390176 avg=388826 std=1024

 

Inference timings in us: Init: 4694, First inference: 1114535, Warmup (avg): 1.11454e+06, Inference (avg): 388826

Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.

Peak memory footprint (MB): init=2.64453 overall=51.043

===========
Same model running on Android 10 using Qualcomm SoC:
Inference timings in us: Init: 161397, First inference: 10590, Warmup (avg): 12686.5, Inference (avg): 12838.2






2 Replies

1,352 Views
jimmychan
NXP TechSupport
NXP TechSupport
0 Kudos

1,345 Views
keithmok
Contributor II

@jimmychan Did you read the log or even try it out yourself ?

Inference timings in us: Init: 4694, First inference: 1114535, Warmup (avg): 1.11454e+06, Inference (avg): 388826


The google tensroflow benchmark will skip the first wamrup inference time already which takes 1114535us , and the average is 388826 us (38 times slower than Qualcomm) which warmup time /iteration  already skipped.

0 Kudos