We're looking to Accelerate the yolo5 model with the NPU on Android (i.mx8m+)
On linux, this is done with the libvx_delegate.so backend.
Example on Linux:
> ./benchmark_model --graph=yolov5n-int8-250.tflite --external_delegate_path=/usr/lib/libvx_delegate.so
</ trim output>
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=60 first=16580 curr=16497 min=16464 max=16735 avg=16536.7 std=49
On Android, this delegate is not present. However, libovxlib.so is, which is apparently the backend for the Android HAL layer to the NPU for NNapi.
NNapi however, cannot accelerate the yolo5 model.
Example on Android:
> ./benchmark_model --graph=yolo5n-int8-250.tflite --use_nnapi=true
STARTING!
Log parameter values verbosely: [0]
Graph: [yolov5n-int8-250.tflite]
Use NNAPI: [1]
NNAPI accelerators available: [vsi-npu,nnapi-reference]
Loaded model yolov5n-int8-250.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for NNAPI.
NNAPI delegate created.
WARNING: NNAPI SL driver did not implement SL_ANeuralNetworksDiagnostic_registerCallbacks!
VERBOSE: Replacing 273 node(s) with delegate (TfLiteNnapiDelegate) node, yielding 7 partitions.
WARNING: NNAPI SL driver did not implement SL_ANeuralNetworksDiagnostic_registerCallbacks!
WARNING: NNAPI SL driver did not implement SL_ANeuralNetworksDiagnostic_registerCallbacks!
WARNING: NNAPI SL driver did not implement SL_ANeuralNetworksDiagnostic_registerCallbacks!
WARNING: NNAPI SL driver did not implement SL_ANeuralNetworksDiagnostic_registerCallbacks!
Explicitly applied NNAPI delegate, and the model graph will be partially executed by the delegate w/ 4 delegate kernels.
The input model file size (MB): 2.16466
Initialized session in 939.695ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
ERROR: NN API returned error ANEURALNETWORKS_OP_FAILED at line 5140 while running computation.
ERROR: Node number 284 (TfLiteNnapiDelegate) failed to invoke.
count=1 curr=1374449
Benchmarking failed.
The NPU hardware is capable of processing the model from the Linux tests, so my question is, how can we compile the vx_delegate that works on Linux, or perhaps use the ovxlib directly to bypass the NNapi on Android?
* Tests done with Android 13 2.0.0 on imx8mpevk, Tensorflow lite 2.10.1 benchmark utility