Using the Android 12.1 release on the IMX8M Plus SOM, as soon as a TFlite model uses an int8 op the nnapi forces running on CPU not NPU.
Is this expected behavior? TFLite will quantize/convert to int8 even if the input and output types are UINT8.
These uint8 models run on the NPU:
https://tfhub.dev/iree/lite-model/mobilenet_v2_100_224/uint8/1
https://tfhub.dev/iree/lite-model/mobilenet_v1_100_224/uint8/1
The second link is the model used in the NXP example:
https://www.nxp.com/docs/en/user-guide/IMX_ANDROID_TENSORFLOWLITE_USERS_GUIDE.pdf
https://github.com/tensorflow/examples/blob/master/lite/examples/image_classification/android/README...
Is such an old model used because the nnapi doesn't support the ops used in newer models?
Using this: adb shell setprop debug.nn.vlog 1
Results in this execution log:
6-12 13:11:08.721 6953 6953 I GraphDump: digraph {: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d0 [style=filled fillcolor=black fontcolor=white label="0 = input[0]\nTQ8A(1x224x224x3)"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d1 [label="1: REF\nTQ8A(32x3x3x3)"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d2 [label="2: COPY\nTI32(32)"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d3 [label="3: COPY\nI32 = 1"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d4 [label="4: COPY\nI32 = 2"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d5 [label="5: COPY\nI32 = 2"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d6 [label="6: COPY\nI32 = 3"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d7 [label="7\nTQ8A(1x112x112x32)"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d8 [label="8: REF\nTQ8A(1x3x3x32)"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d9 [label="9: COPY\nTI32(32)"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d10 [label="10: COPY\nI32 = 1"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d11 [label="11: COPY\nI32 = 1"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d12 [label="12: COPY\nI32 = 1"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d13 [label="13: COPY\nI32 = 1"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d14 [label="14: COPY\nI32 = 3"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d15 [label="15\nTQ8A(1x112x112x32)"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d16 [label="16: REF\nTQ8A(64x1x1x32)"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d17 [label="17: REF\nTI32(64)"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d18 [label="18: COPY\nI32 = 1"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d19 [label="19: COPY\nI32 = 1"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d20 [label="20: COPY\nI32 = 1"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d21 [label="21: COPY\nI32 = 3"]: com.app.app
06-12 13:11:08.721 6953 6953 I GraphDump: d22 [label="22\nTQ8A(1x112x112x64)"]: com.app.app
06-12 13:11:08.722 6953 6953 I GraphDump: d23 [label="23: REF\nTQ8A(1x3x3x64)"]: com.app.app
06-12 13:11:08.722 6953 6953 I GraphDump: d24 [label="24: REF\nTI32(64)"]: com.app.app
06-12 13:11:08.722 6953 6953 I GraphDump: d25 [label="25: COPY\nI32 = 1"]: com.app.app
...................
06-12 13:11:08.745 414 414 I android.hardware.neuralnetworks@1.3-service-vsi-npu-server: getSupportedOperations_1_3: /vendor/bin/hw/android.hardware.neuralnetworks@1.3-service-vsi-n
06-12 13:11:08.757 414 414 I android.hardware.neuralnetworks@1.3-service-vsi-npu-server: : /vendor/bin/hw/android.hardware.neuralnetworks@1.3-service-vsi-n
06-12 13:11:08.757 414 414 I android.hardware.neuralnetworks@1.3-service-vsi-npu-server: getSupportedOperations_1_3 exit: /vendor/bin/hw/android.hardware.neuralnetworks@1.3-service-vsi-n
06-12 13:11:08.758 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:0) = 0 (vsi-npu): com.app.app
06-12 13:11:08.758 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(DEPTHWISE_CONV_2D:1) = 0 (vsi-npu): com.app.app
06-12 13:11:08.758 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:2) = 0 (vsi-npu): com.app.app
06-12 13:11:08.758 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(DEPTHWISE_CONV_2D:3) = 0 (vsi-npu): com.app.app
06-12 13:11:08.758 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:4) = 0 (vsi-npu): com.app.app
06-12 13:11:08.758 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(DEPTHWISE_CONV_2D:5) = 0 (vsi-npu): com.app.app
06-12 13:11:08.758 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:6) = 0 (vsi-npu): com.app.app
06-12 13:11:08.758 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(DEPTHWISE_CONV_2D:7) = 0 (vsi-npu): com.app.app
06-12 13:11:08.758 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:8) = 0 (vsi-npu): com.app.app
06-12 13:11:08.758 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(DEPTHWISE_CONV_2D:9) = 0 (vsi-npu): com.app.app
06-12 13:11:08.758 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:10) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(DEPTHWISE_CONV_2D:11) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:12) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(DEPTHWISE_CONV_2D:13) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:14) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(DEPTHWISE_CONV_2D:15) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:16) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(DEPTHWISE_CONV_2D:17) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:18) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(DEPTHWISE_CONV_2D:19) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:20) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(DEPTHWISE_CONV_2D:21) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:22) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(DEPTHWISE_CONV_2D:23) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:24) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(DEPTHWISE_CONV_2D:25) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:26) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(AVERAGE_POOL_2D:27) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:28) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(RESHAPE:29) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(SOFTMAX:30) = 0 (vsi-npu): com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ModelBuilder::partitionTheWork: only one best device: 0 = vsi-npu: com.app.app
06-12 13:11:08.759 6953 6953 I ExecutionPlan: ExecutionPlan::SimpleBody::finish, compilation: com.app.app
As you can see every step executes on the vsi-npu.
This one does not run on NPU:
https://tfhub.dev/google/lite-model/qat/mobilenet_v2_retinanet_256/1
Log:
06-13 10:07:12.543 3477 3477 I GraphDump: digraph {: com.app.app
06-13 10:07:12.543 3477 3477 I GraphDump: d0 [style=filled fillcolor=black fontcolor=white label="0 = input[0]\nTQ8A(1x256x256x3)"]: com.app.app
06-13 10:07:12.543 3477 3477 I GraphDump: d1 [label="1\nTF32(1x256x256x3)"]: com.app.app
06-13 10:07:12.543 3477 3477 I GraphDump: d2 [label="2\nTENSOR_QUANT8_ASYMM_SIGNED(1x256x256x3)"]: com.app.app
06-13 10:07:12.543 3477 3477 I GraphDump: d3 [label="3: REF\nTENSOR_QUANT8_SYMM_PER_CHANNEL(32x3x3x3)"]: com.app.app
06-13 10:07:12.543 3477 3477 I GraphDump: d4 [label="4: COPY\nTI32(32)"]: com.app.app
06-13 10:07:12.543 3477 3477 I GraphDump: d5 [label="5: COPY\nI32 = 1"]: com.app.app
06-13 10:07:12.543 3477 3477 I GraphDump: d6 [label="6: COPY\nI32 = 2"]: com.app.app
06-13 10:07:12.543 3477 3477 I GraphDump: d7 [label="7: COPY\nI32 = 2"]: com.app.app
06-13 10:07:12.543 3477 3477 I GraphDump: d8 [label="8: COPY\nI32 = 3"]: com.app.app
06-13 10:07:12.543 3477 3477 I GraphDump: d9 [label="9\nTENSOR_QUANT8_ASYMM_SIGNED(1x128x128x32)"]: com.app.app
06-13 10:07:12.543 3477 3477 I GraphDump: d10 [label="10: REF\nTENSOR_QUANT8_SYMM_PER_CHANNEL(1x3x3x32)"]: com.app.app
06-13 10:07:12.543 3477 3477 I GraphDump: d11 [label="11: COPY\nTI32(32)"]: com.app.app
06-13 10:07:12.543 3477 3477 I GraphDump: d12 [label="12: COPY\nI32 = 1"]: com.app.app
06-13 10:07:12.543 3477 3477 I GraphDump: d13 [label="13: COPY\nI32 = 1"]: com.app.app
06-13 10:07:12.543 3477 3477 I GraphDump: d14 [label="14: COPY\nI32 = 1"]: com.app.app
06-13 10:07:12.543 3477 3477 I GraphDump: d15 [label="15: COPY\nI32 = 1"]: com.app.app
06-13 10:07:12.544 3477 3477 I GraphDump: d16 [label="16: COPY\nI32 = 3"]: com.app.app
06-13 10:07:12.545 3477 3477 I GraphDump: d17 [label="17\nTENSOR_QUANT8_ASYMM_SIGNED(1x128x128x32)"]: com.app.app
06-13 10:07:12.545 3477 3477 I GraphDump: d18 [label="18: REF\nTENSOR_QUANT8_SYMM_PER_CHANNEL(16x1x1x32)"]: com.app.app
06-13 10:07:12.545 3477 3477 I GraphDump: d19 [label="19: COPY\nTI32(16)"]: com.app.app
06-13 10:07:12.545 3477 3477 I GraphDump: d20 [label="20: COPY\nI32 = 1"]: com.app.app
06-13 10:07:12.545 3477 3477 I GraphDump: d21 [label="21: COPY\nI32 = 1"]: com.app.app
06-13 10:07:12.545 3477 3477 I GraphDump: d22 [label="22: COPY\nI32 = 1"]: com.app.app
06-13 10:07:12.545 3477 3477 I GraphDump: d23 [label="23: COPY\nI32 = 0"]: com.app.app
06-13 10:07:12.546 3477 3477 I GraphDump: d24 [label="24\nTENSOR_QUANT8_ASYMM_SIGNED(1x128x128x16)"]: com.app.app
06-13 10:07:12.546 3477 3477 I GraphDump: d25 [label="25: REF\nTENSOR_QUANT8_SYMM_PER_CHANNEL(96x1x1x16)"]: com.app.app
06-13 10:07:12.546 3477 3477 I GraphDump: d26 [label="26: REF\nTI32(96)"]: com.app.app
06-13 10:07:12.546 3477 3477 I GraphDump: d27 [label="27: COPY\nI32 = 1"]: com.app.app
06-13 10:07:12.546 3477 3477 I GraphDump: d28 [label="28: COPY\nI32 = 1"]: com.app.app
06-13 10:07:12.546 3477 3477 I GraphDump: d29 [label="29: COPY\nI32 = 1"]: com.app.app
06-13 10:07:12.546 3477 3477 I GraphDump: d30 [label="30: COPY\nI32 = 3"]: com.app.app
06-13 10:07:12.546 3477 3477 I GraphDump: d31 [label="31\nTENSOR_QUANT8_ASYMM_SIGNED(1x128x128x96)"]: com.app.app
06-13 10:07:12.546 3477 3477 I GraphDump: d32 [label="32: REF\nTENSOR_QUANT8_SYMM_PER_CHANNEL(1x3x3x96)"]: com.app.app
........
06-13 10:07:13.659 3477 3477 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(DEQUANTIZE:0) = 0 (vsi-npu): com.app.app
06-13 10:07:13.659 3477 3477 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(QUANTIZE:1) = 0 (vsi-npu): com.app.app
06-13 10:07:13.659 3477 3477 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:2) = 1 (nnapi-reference): com.app.app
06-13 10:07:13.659 3477 3477 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(DEPTHWISE_CONV_2D:3) = 1 (nnapi-reference): com.app.app
06-13 10:07:13.660 3477 3477 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:4) = 1 (nnapi-reference): com.app.app
06-13 10:07:13.660 3477 3477 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:5) = 1 (nnapi-reference): com.app.app
06-13 10:07:13.660 3477 3477 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(DEPTHWISE_CONV_2D:6) = 1 (nnapi-reference): com.app.app
06-13 10:07:13.660 3477 3477 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:7) = 1 (nnapi-reference): com.app.app
06-13 10:07:13.660 3477 3477 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:8) = 1 (nnapi-reference): com.app.app
06-13 10:07:13.660 3477 3477 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(DEPTHWISE_CONV_2D:9) = 1 (nnapi-reference): com.app.app
06-13 10:07:13.660 3477 3477 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:10) = 1 (nnapi-reference): com.app.app
06-13 10:07:13.660 3477 3477 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(ADD:11) = 1 (nnapi-reference): com.app.app
06-13 10:07:13.660 3477 3477 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:12) = 1 (nnapi-reference): com.app.app
06-13 10:07:13.660 3477 3477 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(DEPTHWISE_CONV_2D:13) = 1 (nnapi-reference): com.app.app
06-13 10:07:13.660 3477 3477 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:14) = 1 (nnapi-reference): com.app.app
06-13 10:07:13.660 3477 3477 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:15) = 1 (nnapi-reference): com.app.app
06-13 10:07:13.660 3477 3477 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(DEPTHWISE_CONV_2D:16) = 1 (nnapi-reference): com.app.app
06-13 10:07:13.660 3477 3477 I ExecutionPlan: ModelBuilder::findBestDeviceForEachOperation(CONV_2D:17) = 1 (nnapi-reference): com.app.app
I have also tested this with 224x224 int8 signed models, they also run on the CPU.
It seems like only TQ8A is supported not TENSOR_QUANT8_ASYMM_SIGNED? Or is there some other reason the model is not running on the NPU? Looking at the NXP documentation it makes no mention that only UINT8 ops are supported in the nnapi, it seems like that should be noted somewhere as it seems newer TFLite quantizes to INT8 not UINT8.
Any help or insight would be much appreciated, thank you.