Hello Community,
I am running AI inference on i.MX8qmmek with BSP 5.4.3_2.0.0. I have a custom TfLite application written in C++ to run inference on mobilenet/ mobilenet+SSD models.
The application seems to use GPU/CPU neon acceleration as the inference times are almost 4x times using GPU compared CPU only computations.
However, the problem comes when I compare the "label_image" application with my custom application.
Model used: mobilenet 0.25 (128x128) quantized
The GPU accelerated inference times are as follows:
1. "label_image" sample application - 1.6 ms
2. custom application - 11 ms
The CPU neon accelerated inference times are as follows:
1. "label_image" sample application - 2.7 ms
2. custom application - 56 ms
I cannot understand where this difference is coming from. One of my observation when using GPU acceleration is,
with the "label_image" sample application, the console shows INFO: Created TensorFlow Lite delegate for NNAPI.
Applied NNAPI delegate.invoked. However with my custom appluication, it shows INFO: Created TensorFlow Lite delegate for NNAPI. NNAPI acceleration is unsupported on this platform.
The code snippet I am using for this is as below:
unique_ptr<tflite::FlatBufferModel> model = tflite::FlatBufferModel::BuildFromFile(get_modelPath().c_str());
tflite::ops::builtin::BuiltinOpResolver resolver;
unique_ptr<tflite::Interpreter> interpreter;
tflite::InterpreterBuilder(*model.get(), resolver)(&interpreter);
interpreter->UseNNAPI(true);
interpreter->SetNumThreads(2);
TfLiteDelegatePtrMap delegates_;
auto delegate = TfLiteDelegatePtr(nullptr, [](TfLiteDelegate*) {});
if (!delegate) {
cout << "NNAPI acceleration is unsupported on this platform.";
} else {
delegates_.emplace("NNAPI", std::move(delegate));
}
for (const auto& delegate : delegates_) {
if (interpreter->ModifyGraphWithDelegate(delegate.second.get()) !=
kTfLiteOk) {
cout << "Failed to apply " << delegate.first << " delegate.";
} else {
cout << "Applied " << delegate.first << " delegate.";
}
}
interpreter->AllocateTensors();
memcpy(interpreter->typed_input_tensor<uchar>(0), resized_image.data, resized_image.total() *resized_image.elemSize());
interpreter->invoke(); //time is calculated for this function call.
I tried understanding the cause but no luck until now. Help is much appreciated:-)
Best Regards
Hi Ullas,
I see you are using the following code:
auto delegate = TfLiteDelegatePtr(nullptr, [](TfLiteDelegate*) {});
That is actually the code that fails (you actually get a NULL pointer).
Have you actually tried to allocate a new TfLiteDelegate instead?
Thanks
Marco
Hello Marco,
Thanks for your reply.
You are right, I used auto delegate = TfLiteDelegatePtr(tflite::NnApiDelegate(), [](TfLiteDelegate*) {});
However, the performance in terms of inference times (for interpreter->invoke()) is reduced drastically for non-quanitized SSD models compared to the previous BSP version (5.4.3_2.0.0) where I just used useNNAPI(true). So there is something wrong in the below code of creating a delegate? Can you please help how you are enabling it?
Here is the summary....
1. BSP Version: 5.4.3_2.0.0
Enabling Acceleration : interpreter->UseNNAPI(true)
Result: All models run perfectly fine
2. BSP Version: 5.4.24_2_1.0
Enabling Acceleration : Using below code
using TfLiteDelegatePtr = tflite::Interpreter::TfLiteDelegatePtr;
using TfLiteDelegatePtrMap = std::map<std::string, TfLiteDelegatePtr>;
TfLiteDelegatePtrMap delegates_;
auto delegate = TfLiteDelegatePtr(tflite::NnApiDelegate(), [](TfLiteDelegate*) {});
if (!delegate) {
cout << "NNAPI acceleration is unsupported on this platform.";
} else {
delegates_.emplace("NNAPI", std::move(delegate));
}
for (const auto& delegate : delegates_) {
if (interpreter->ModifyGraphWithDelegate(delegate.second.get()) !=
kTfLiteOk) {
cout << "Failed to apply " << delegate.first << " delegate.";
} else {
cout << "Applied " << delegate.first << " delegate.";
}
}
Result: SSD models,especially non quantized models take drastically long inference times. Ex: SSD Mobilenet v2 Coco takes > 500 ms
Best Regards
Ullas Bharadwaj
Hi Ullas,
in the example mentioned, the delegate is actually used only to figure out what acceleration is available on the platform the example is running on.
We will figure out why the code is not running, but for the sake of your test you can avoid using the delegate, you can just do something like:
interpreter->UseNNAPI(get_useGPU());
Please let me know whether it resolves you issue.
Again, we are also trying to figure out whether we need to modify the original code.
Thank you
Best Regards
Marco
Hello Marco,
Thanks for the reply.
Yes, I was doing as you suggested on BSP 5.4.3_2.0.0 and it worked fine. Only in 5.4.24_2.1.0, I get the error "UseNNAPI is not supported. Use ModifyGraphWithDelegate instead.".
( I am currently not with the Target, so cannot exact ordering of words maybe wrong)
Best Regards
Ullas Bharadwaj
Hi Ullas,
could you please send me the full log showing the issue with 5.4.24?
Furthermore, have you already checked the up-to-date version of the label_image example?
Thank you
Marco