Unable to duplicate nnapi performance with tensorflow lite benchmark_model

dennis3 · ‎04-13-2022

The guide i.MX TensorFlow Lite on Android User's Guide for Android 11_2.2.0 section 2.2 shows steps to run the benchmark utility. I've followed those steps to compile the utility and run the tests but our output is not fast with NNapi.

The document section 2.2 table 1 shows approx 32ms for 4 threads, which is very close to our results but the nnapi results are much longer. Table 1 says we should see around 4ms but we see 385ms. Model v1 does appear to be accelerated w/ similar results as advertised in the document but the v2 model is not.

Can anyone suggest how to debug why we are not seeing the expected results?

# MODEL v1 results are as expected
./benchmark_model --verbose=true --graph=mobimobilenet_v1_1.0_224_quant.tflite  mobilenet_v2_1.0_224.tflite
/benchmark_model --verbose=true --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true                                               <
STARTING!
Log parameter values verbosely: [1]
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [1]
Use caching: [0]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [mobilenet_v1_1.0_224_quant.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input value files: []
Allow fp16: [0]
Require full delegation: [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
#threads used for CPU inference: [1]
Max number of delegated partitions: [0]
Min nodes per partition: [0]
External delegate path: []
External delegate options: []
Use gpu: [0]
Allow lower precision in gpu: [1]
Enable running quant models in gpu: [1]
GPU backend: []
Use Hexagon: [0]
Hexagon lib path: [/data/local/tmp]
Hexagon profiling: [0]
Use NNAPI: [1]
NNAPI execution preference: []
Model execution priority in nnapi: []
NNAPI accelerator name: []
NNAPI accelerators available: [vsi-npu,nnapi-reference]
Disable NNAPI cpu: [0]
Allow fp16 in NNAPI: [0]
Use xnnpack: [0]
Loaded model mobilenet_v1_1.0_224_quant.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for NNAPI.
Explicitly applied NNAPI delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 4.27635
Initialized session in 72.658ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=6667292

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=265 first=3537 curr=3701 min=3472 max=4557 avg=3700.26 std=72

Inference timings in us: Init: 72658, First inference: 6667292, Warmup (avg): 6.66729e+06, Inference (avg): 3700.26
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=3.92969 overall=4.29688

# MODEL v2 is very slow but advertised speed is fast 
./benchmark_model --verbose=true --graph=mobilenet_v2_1.0_224.tflite --use_nnapi=true                              
STARTING!
Log parameter values verbosely: [1]
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [1]
Use caching: [0]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [mobilenet_v2_1.0_224.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input value files: []
Allow fp16: [0]
Require full delegation: [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
#threads used for CPU inference: [1]
Max number of delegated partitions: [0]
Min nodes per partition: [0]
External delegate path: []
External delegate options: []
Use gpu: [0]
Allow lower precision in gpu: [1]
Enable running quant models in gpu: [1]
GPU backend: []
Use Hexagon: [0]
Hexagon lib path: [/data/local/tmp]
Hexagon profiling: [0]
Use NNAPI: [1]
NNAPI execution preference: []
Model execution priority in nnapi: []
NNAPI accelerator name: []
NNAPI accelerators available: [vsi-npu,nnapi-reference]
Disable NNAPI cpu: [0]
Allow fp16 in NNAPI: [0]
Use xnnpack: [0]
Loaded model mobilenet_v2_1.0_224.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for NNAPI.
Explicitly applied NNAPI delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 13.9786
Initialized session in 134.637ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=1322377

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=385412 curr=385390 min=384878 max=385597 avg=385314 std=132

Inference timings in us: Init: 134637, First inference: 1322377, Warmup (avg): 1.32238e+06, Inference (avg): 385314
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=5.09766 overall=6.85938

# MODEL w/ cpu 4 threads is similar to advertised results
./benchmark_model --verbose=true --graph=mobilenet_v2_1.0_224.tflite --num_threads=4 --use_xnnpack=true                                   
STARTING!
Log parameter values verbosely: [1]
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [4]
Use caching: [0]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [mobilenet_v2_1.0_224.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input value files: []
Allow fp16: [0]
Require full delegation: [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
#threads used for CPU inference: [4]
Max number of delegated partitions: [0]
Min nodes per partition: [0]
External delegate path: []
External delegate options: []
Use gpu: [0]
Allow lower precision in gpu: [1]
Enable running quant models in gpu: [1]
GPU backend: []
Use Hexagon: [0]
Hexagon lib path: [/data/local/tmp]
Hexagon profiling: [0]
Use NNAPI: [0]
Use xnnpack: [1]
Loaded model mobilenet_v2_1.0_224.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Explicitly applied XNNPACK delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 13.9786
Initialized session in 76.25ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=15 first=38931 curr=33399 min=33313 max=38931 avg=33821 std=1368

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=33769 curr=33531 min=33242 max=55763 avg=34107.3 std=3161

Inference timings in us: Init: 76250, First inference: 38931, Warmup (avg): 33821, Inference (avg): 34107.3
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=29.0898 overall=37.2695

dennis3 · ‎04-25-2022

Ok, the immediate answer on this is that we mistakingly used the non-quantized model when testing. I didn't catch that. The document for the benchmark tool is using the quantized model. I don't fully understand the technical details of the nnapi/NPU enough to know why the non-quantized model is so much slower than then CPU version. However, when we used the correct model per the benchmark document, we are seeing the expected results.

./benchmark_model --graph=mobilenet_v2_1.0_224_quant.tflite --use_nnapi=true                                                                                                                                                        
STARTING!
Log parameter values verbosely: [0]
Graph: [mobilenet_v2_1.0_224_quant.tflite]
Use NNAPI: [1]
NNAPI accelerators available: [vsi-npu,nnapi-reference]
Loaded model mobilenet_v2_1.0_224_quant.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for NNAPI.
Explicitly applied NNAPI delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 3.57776
Initialized session in 96.515ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=8354555

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=242 first=3761 curr=3990 min=3657 max=8631 avg=4053.44 std=385

Inference timings in us: Init: 96515, First inference: 8354555, Warmup (avg): 8.35456e+06, Inference (avg): 4053.44

View solution in original post

dennis3 · ‎04-25-2022

Ok, the immediate answer on this is that we mistakingly used the non-quantized model when testing. I didn't catch that. The document for the benchmark tool is using the quantized model. I don't fully understand the technical details of the nnapi/NPU enough to know why the non-quantized model is so much slower than then CPU version. However, when we used the correct model per the benchmark document, we are seeing the expected results.

./benchmark_model --graph=mobilenet_v2_1.0_224_quant.tflite --use_nnapi=true                                                                                                                                                        
STARTING!
Log parameter values verbosely: [0]
Graph: [mobilenet_v2_1.0_224_quant.tflite]
Use NNAPI: [1]
NNAPI accelerators available: [vsi-npu,nnapi-reference]
Loaded model mobilenet_v2_1.0_224_quant.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for NNAPI.
Explicitly applied NNAPI delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 3.57776
Initialized session in 96.515ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=8354555

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=242 first=3761 curr=3990 min=3657 max=8631 avg=4053.44 std=385

Inference timings in us: Init: 96515, First inference: 8354555, Warmup (avg): 8.35456e+06, Inference (avg): 4053.44

Zhiming_Liu · ‎04-17-2022

Is that your first time to run NPU test?

The guide has noted that first time run is slow.

And we have AN12964 about NPU warmup time:

https://www.nxp.com/webapp/sps/download/preDownload.jsp?render=true

dennis3 · ‎04-18-2022

@Zhiming_Liu Thanks for the reply. No this isn't our first run. I've tried to duplicate the results on several devices for the past while and each time I run the benchmark a number of times to make sure I'm getting consistent results. The only thing that is consistent is that the v1 model results are consistent with documented results and the v2 model results are very slow/long. However near as I can tell the v2 model is the same model the documented results were obtained with.