The guide i.MX TensorFlow Lite on Android User's Guide for Android 11_2.2.0 section 2.2 shows steps to run the benchmark utility. I've followed those steps to compile the utility and run the tests but our output is not fast with NNapi.
The document section 2.2 table 1 shows approx 32ms for 4 threads, which is very close to our results but the nnapi results are much longer. Table 1 says we should see around 4ms but we see 385ms. Model v1 does appear to be accelerated w/ similar results as advertised in the document but the v2 model is not.
Can anyone suggest how to debug why we are not seeing the expected results?
# MODEL v1 results are as expected
./benchmark_model --verbose=true --graph=mobimobilenet_v1_1.0_224_quant.tflite mobilenet_v2_1.0_224.tflite
/benchmark_model --verbose=true --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true <
STARTING!
Log parameter values verbosely: [1]
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [1]
Use caching: [0]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [mobilenet_v1_1.0_224_quant.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input value files: []
Allow fp16: [0]
Require full delegation: [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
#threads used for CPU inference: [1]
Max number of delegated partitions: [0]
Min nodes per partition: [0]
External delegate path: []
External delegate options: []
Use gpu: [0]
Allow lower precision in gpu: [1]
Enable running quant models in gpu: [1]
GPU backend: []
Use Hexagon: [0]
Hexagon lib path: [/data/local/tmp]
Hexagon profiling: [0]
Use NNAPI: [1]
NNAPI execution preference: []
Model execution priority in nnapi: []
NNAPI accelerator name: []
NNAPI accelerators available: [vsi-npu,nnapi-reference]
Disable NNAPI cpu: [0]
Allow fp16 in NNAPI: [0]
Use xnnpack: [0]
Loaded model mobilenet_v1_1.0_224_quant.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for NNAPI.
Explicitly applied NNAPI delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 4.27635
Initialized session in 72.658ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=6667292
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=265 first=3537 curr=3701 min=3472 max=4557 avg=3700.26 std=72
Inference timings in us: Init: 72658, First inference: 6667292, Warmup (avg): 6.66729e+06, Inference (avg): 3700.26
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=3.92969 overall=4.29688
# MODEL v2 is very slow but advertised speed is fast
./benchmark_model --verbose=true --graph=mobilenet_v2_1.0_224.tflite --use_nnapi=true
STARTING!
Log parameter values verbosely: [1]
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [1]
Use caching: [0]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [mobilenet_v2_1.0_224.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input value files: []
Allow fp16: [0]
Require full delegation: [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
#threads used for CPU inference: [1]
Max number of delegated partitions: [0]
Min nodes per partition: [0]
External delegate path: []
External delegate options: []
Use gpu: [0]
Allow lower precision in gpu: [1]
Enable running quant models in gpu: [1]
GPU backend: []
Use Hexagon: [0]
Hexagon lib path: [/data/local/tmp]
Hexagon profiling: [0]
Use NNAPI: [1]
NNAPI execution preference: []
Model execution priority in nnapi: []
NNAPI accelerator name: []
NNAPI accelerators available: [vsi-npu,nnapi-reference]
Disable NNAPI cpu: [0]
Allow fp16 in NNAPI: [0]
Use xnnpack: [0]
Loaded model mobilenet_v2_1.0_224.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for NNAPI.
Explicitly applied NNAPI delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 13.9786
Initialized session in 134.637ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=1322377
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=385412 curr=385390 min=384878 max=385597 avg=385314 std=132
Inference timings in us: Init: 134637, First inference: 1322377, Warmup (avg): 1.32238e+06, Inference (avg): 385314
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=5.09766 overall=6.85938
# MODEL w/ cpu 4 threads is similar to advertised results
./benchmark_model --verbose=true --graph=mobilenet_v2_1.0_224.tflite --num_threads=4 --use_xnnpack=true
STARTING!
Log parameter values verbosely: [1]
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [4]
Use caching: [0]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [mobilenet_v2_1.0_224.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input value files: []
Allow fp16: [0]
Require full delegation: [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
#threads used for CPU inference: [4]
Max number of delegated partitions: [0]
Min nodes per partition: [0]
External delegate path: []
External delegate options: []
Use gpu: [0]
Allow lower precision in gpu: [1]
Enable running quant models in gpu: [1]
GPU backend: []
Use Hexagon: [0]
Hexagon lib path: [/data/local/tmp]
Hexagon profiling: [0]
Use NNAPI: [0]
Use xnnpack: [1]
Loaded model mobilenet_v2_1.0_224.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Explicitly applied XNNPACK delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 13.9786
Initialized session in 76.25ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=15 first=38931 curr=33399 min=33313 max=38931 avg=33821 std=1368
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=33769 curr=33531 min=33242 max=55763 avg=34107.3 std=3161
Inference timings in us: Init: 76250, First inference: 38931, Warmup (avg): 33821, Inference (avg): 34107.3
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=29.0898 overall=37.2695