NXP i.XM8MP EVK：NNAPI run insightface in Android

Geo · ‎03-02-2021

My Environment
Hardware: NXP i.XM8MP EVK A01
Software: Android version 10
Model：insightface_quant Input：type: uint8[1,112,112,3]Output：type: float32[1,512]

I try to use NNAPI load insightface to inference in Android.
When I load the model that npu will do VsiPreparedModel::initialize() three times.
Then when I run predict, npu will do compute three times.
So total cost time will be same use CPU.
Even I use smaller size model insightface_r32(34.5MB) there will be a issue.

Please refer attach file.

xiaofengren · ‎03-11-2021

The reason that you observed that VsiPreparedModel::initialize() three times is due to your model were splitted to 3 sub-graph, those sub-graph were executed separately by VsiNpu. Would you please refer to following command to enable npu profiling?

On Target

Click 10 times on About Tablet option in Settings, to become a developer

Choose Settings -> Developer Options -> OEM Unlocking to enable OEM unlocking.

In Android terminal (UART terminal) enter the following command:

$ reboot bootloader

On Host

device connected via USB-C:

$ sudo fastboot oem unlock

disable the DM-verity

$ adb root

$ adb disable-verity

$ adb reboot

disable selinux, exec the below cmd in uboot command,

# setenv append_bootargs androidboot.selinux=permissive

or

$ setenforce 0

After unlock android, then run following steps to enable profiling service:

Rename service binary in /vendor/bin/hw/ from android.neural.network***vsi-npu*** to other name.
Kill current server: ps -ef | grep vsi-npu then kill it.
Start the service from /vendor/bin/hw
setprop VSI_NN_LOG_LEVEL 5
Collect log in logcat

raluca_popa · ‎03-11-2021

Hi @Geo ,

How did you obtain the insight face model? Can you share it? Did you use the 'benchmark_model' eIQ TFLite app or custom code?

Thanks,

Raluca

Geo · ‎03-29-2021

update my current state of issue

benchmark download from https://storage.googleapis.com/tensorflow-nightly-public/prod/tensorflow/release/lite/tools/nightly/...

Attached are reports on whether NNAPI has been used or not.

insightface_r100_quant_4_1_50_profiling.txt ===> ./android_aarch64_benchmark_model_plus_flex --num_threads=4 --graph=insightface_r100_quant.tflite --warmup_runs=1 --num_runs=50 --enable_op_profiling=true > insightface_r100_quant_4_1_50_profiling.txt
insightface_r100_quant_4_1_50_nnapi_profiling.txt ===> ./android_aarch64_benchmark_model_plus_flex --num_threads=4 --graph=insightface_r100_quant.tflite --warmup_runs=1 --num_runs=50 --use_nnapi=true --enable_op_profiling=true > insightface_r100_quant_4_1_50_nnapi_profiling.txt

The inference time of using NNAPI is 491ms, and the inference time of not using NNAPI is 988ms
Is this reasonable? I originally thought that using NPU can be within 400ms.

Another problem is that even though the inference time of benchmark is 491ms, the inference time of Tensorflow Lite on Android is nearly 1000ms warmup time is 4950ms
PLease refer attach file Android_TensorFlow_Lite_debug.nn.vlog==1.txt
Is this reasonable? I thought the inference time using NNAPI on Tensorflow Lite should be about 491ms as the benchmark.

Geo · ‎03-29-2021

Update my current state of issue

benchmark download from https://storage.googleapis.com/tensorflow-nightly-public/prod/tensorflow/release/lite/tools/nightly/...

I use benchmark to run model in NXP i.XM8MP EVK A01.
Attached are reports on whether NNAPI has been used or not.
insightface_r100_quant_4_1_50_nnapi_profiling.txt ==> ./android_aarch64_benchmark_model_plus_flex --num_threads=4 --graph=insightface_r100_quant.tflite --warmup_runs=1 --num_runs=50 --use_nnapi=true --enable_op_profiling=true > insightface_r100_quant_4_1_50_nnapi_profiling.txt
insightface_r100_quant_4_1_50_profiling.txt ===> ./android_aarch64_benchmark_model_plus_flex --num_threads=4 --graph=insightface_r100_quant.tflite --warmup_runs=1 --num_runs=50 --enable_op_profiling=true > insightface_r100_quant_4_1_50_profiling.txt

The inference time with NNAPI(491ms) is faster than without NNAPI(988ms).
Is this reasonable? I originally thought that using NPU can be within 400ms.

Another question is that even if the result of using the benchmark is 491ms, but using TensorFlow Lite in Android, the total cost is still close to 1000 ms, and the warmup time is 4950 ms.
Please refer attach file Android_TensorFlow_Lite_debug.nn.vlog==1.txt
Is this reasonable? I thought that inference time should be 490ms in TensorFlow lite.

Geo · ‎03-14-2021

My environment
Python 3.7.0
tensorflow 2.4.0

The model is from https://github.com/deepinsight/insightface, and use mmdnn converted to pd format and then use tensorflow converted to tflite.
I uploaded insightface_r100_quant.tflite to wetransfer link：https://we.tl/t-Mdz4PKLYJv

insightface_r100_quant.tflite
input：name: data type: uint8[1,112,112,3]
Output：name: output type: float32[1,512]

Attach file is use insightface_r100_quant.tflite run benchmark on NXP i.XM8MP EVK

PhamHoangBao · ‎03-30-2021

Dear @Geo ,

Could you please upload again your TFLite file in wetransfer link because this is expired now?

And could you mind sharing your python code to convert InsightFace model to TFLite format with input is uint8 and output is float32 ? I have no idea on how to convert to TFLite model which input and output have different data type.

Thank you so much!

Bao