Hello Community,
I am running AI inference on i.MX8qmmek with BSP 5.4.3_2.0.0. I have a custom TfLite application written in C++ to run inference on mobilenet/ mobilenet+SSD models.
The application seems to use GPU/CPU neon acceleration as the inference times are almost 4x times using GPU compared CPU only computations.
However, the problem comes when I compare the "label_image" application with my custom application.
Model used: mobilenet 0.25 (128x128) quantized
The GPU accelerated inference times are as follows:
1. "label_image" sample application - 1.6 ms
2. custom application - 11 ms
The CPU neon accelerated inference times are as follows:
1. "label_image" sample application - 2.7 ms
2. custom application - 56 ms
I cannot understand where this difference is coming from. One of my observation when using GPU acceleration is,
with the "label_image" sample application, the console shows INFO: Created TensorFlow Lite delegate for NNAPI.
Applied NNAPI delegate.invoked. However with my custom appluication, it shows INFO: Created TensorFlow Lite delegate for NNAPI. NNAPI acceleration is unsupported on this platform.
The code snippet I am using for this is as below:
unique_ptr<tflite::FlatBufferModel> model = tflite::FlatBufferModel::BuildFromFile(get_modelPath().c_str());
tflite::ops::builtin::BuiltinOpResolver resolver;
unique_ptr<tflite::Interpreter> interpreter;
tflite::InterpreterBuilder(*model.get(), resolver)(&interpreter);
interpreter->UseNNAPI(true);
interpreter->SetNumThreads(2);
TfLiteDelegatePtrMap delegates_;
auto delegate = TfLiteDelegatePtr(nullptr, [](TfLiteDelegate*) {});
if (!delegate) {
cout << "NNAPI acceleration is unsupported on this platform.";
} else {
delegates_.emplace("NNAPI", std::move(delegate));
}
for (const auto& delegate : delegates_) {
if (interpreter->ModifyGraphWithDelegate(delegate.second.get()) !=
kTfLiteOk) {
cout << "Failed to apply " << delegate.first << " delegate.";
} else {
cout << "Applied " << delegate.first << " delegate.";
}
}
interpreter->AllocateTensors();
memcpy(interpreter->typed_input_tensor<uchar>(0), resized_image.data, resized_image.total() *resized_image.elemSize());
interpreter->invoke(); //time is calculated for this function call.
I tried understanding the cause but no luck until now. Help is much appreciated:-)
Best Regards