Hi all,
I have a TFlite model. Specified external delegate with python. Observe correct behavior. First warmup run takes ~46 seconds, thereafter 0.5 seconds per execution.
Converted code to C++. Specified external_delegate options with vx_delegate. Options initialize, but it still uses XNNPACK delegate when the builder and interpreter are constructed.
C++ behavior shows that vx_delegate is not being called, first warmup run and all runs thereafter take ~46 seconds, ie no hardware acceleration is occurring.
The model has float inputs and outputs. This didn't bother the python. C++ I tried specifiying
interpreter->SetAllowFp16PrecisionForFP32(true);
This caused a segmentation fault. So I have removed it.
Python code:
setup the interpreter:
self.interpreter = tflite.Interpreter(model_path='./model_integer_quant.tflite', experimental_delegates=[tflite.load_delegate('/usr/lib/libvx_delegate.so')])
self.interpreter.allocate_tensors()
self.input_details = self.interpreter.get_input_details()
self.output_details = self.interpreter.get_output_details()
# check the type of the input tensor
self.floating_model = self.input_details[0]['dtype'] == np.float32
print("self.floating_model is ", self.floating_model)
invoke the interpreter:
def detect(self, image, object=None):
interpreter = self.interpreter
input_data = self.preprocess(image)
interpreter.set_tensor(self.input_details[0]['index'], input_data)
# start_time = time.time()
interpreter.invoke()
# stop_time = time.time()
output_data = interpreter.get_tensor(self.output_details[0]['index'])
results = self.postprocess(output_data)
if object is not None:
results = [result for result in results if result['cls_name'] == object]
return results
preprocessing inputs:
def preprocess(self, image):
# load image
if type(image) == str: # Load from file path
if not os.path.isfile(image):
raise ValueError("Input image file path (", image, ") does not exist.")
image = cv.imread(image)
elif isinstance(image, np.ndarray): # Use given NumPy array
image = image.copy()
else:
raise ValueError("Invalid image input. Only file paths or a NumPy array accepted.")
self.img_height = image.shape[0]
self.img_width = image.shape[1]
# resize and padding
image = self.letterbox(image)
# BGR -> RGB
image = image[:, :, ::-1]
# image = cv.cvtColor(image, cv.COLOR_BGR2RGB)
# add N dim
input_data = np.expand_dims(image, axis=0)
if self.floating_model:
input_data = np.float32(input_data) / 255 // input data is np.float32
else:
input_data = input_data.astype(np.int8)
return input_data
Python output:
Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.
self.floating_model is True
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
Processing ./image0frame37373.jpg - time: 47.29401898384094 s
Processing ./image0frame37954.jpg - time: 0.4757547378540039 s
Processing ./image0frame40189.jpg - time: 0.4649374485015869 s
Processing ./image0frame30.jpg - time: 0.46975016593933105 s
Processing ./image0frame937.jpg - time: 0.46828198432922363 s
Processing ./image0frame674.jpg - time: 0.46784138679504395 s
Processing ./image0frame37487.jpg - time: 0.4682295322418213 s
Processing ./image0frame36965.jpg - time: 0.475665807723999 s
Processing ./image0frame40527.jpg - time: 0.4699666500091553 s
Processing ./image0frame852.jpg - time: 0.4759962558746338 s
Processing ./result1.jpg - time: 0.46816182136535645 s
Processing ./image0frame1.jpg - time: 0.4685957431793213 s
Processing ./image0frame40183.jpg - time: 0.46500706672668457 s
Processing ./image0frame36962.jpg - time: 0.47954559326171875 s
Processing ./image0frame842.jpg - time: 0.472883939743042 s
Processing ./image0frame38968.jpg - time: 0.4709169864654541 s
Processing ./result0.jpg - time: 0.46674442291259766 s
Processing ./viper_snapshot.jpg - time: 0.4655272960662842 s
Processing ./image0frame40808.jpg - time: 0.464113712310791 s
Processing ./image0frame40814.jpg - time: 0.4668314456939697 s
Processing ./image0frame668.jpg - time: 0.46632933616638184 s
Processing ./image0frame900.jpg - time: 0.47098541259765625 s
Processing ./image0frame37379.jpg - time: 0.4749300479888916 s
Processing ./image0frame0.jpg - time: 0.46938037872314453 s
Processing ./image0frame420.jpg - time: 0.4686119556427002 s
Processing ./image0frame37956.jpg - time: 0.477811336517334 s
Processing ./image0frame38974.jpg - time: 0.4715125560760498 s
Processing ./image0frame40532.jpg - time: 0.475177526473999 s
Processing ./image0frame37484.jpg - time: 0.4749460220336914 s
Above code is working, I try to do the same thing with C++:
// flatten rgb image to input layer.
Solved! Go to Solution.
Hello,
I have figured out the problem. The issue was that the interpreter is being created each time in the function. I needed to initialize the interpreter once and then pass it into the function, rather than creating it anew each time.
Once I did this, I got the expected behaviour. Thanks.
Hello,
I have figured out the problem. The issue was that the interpreter is being created each time in the function. I needed to initialize the interpreter once and then pass it into the function, rather than creating it anew each time.
Once I did this, I got the expected behaviour. Thanks.
I followed your suggestion to setup the model for the npu and came upt with this code: My model is already working in python on the npu.
#include <stdio.h>
#include <iostream>
#include <tensorflow/lite/interpreter.h>
#include <tensorflow/lite/kernels/register.h>
#include <ctime>
#include <cstdlib>
#include <vector>
#include <tensorflow/lite/c/common.h>
#include <tensorflow/lite/model.h>
#include <memory>
#include <tensorflow/lite/tools/gen_op_registration.h>
#include <tensorflow/lite/delegates/nnapi/nnapi_delegate.h>
#include <tensorflow/lite/delegates/external/external_delegate.h>
#include <opencv2/opencv.hpp>
#include <vector>
#include <chrono>
#include <tensorflow-lite-vx-delegate/vsi_npu_custom_op.h>
#include <tensorflow-lite-vx-delegate/delegate_main.h>
int main() {
//load image
std::string image_path = "/usr/src/Screenshot.png";
cv::Mat img = cv::imread(image_path);
if (img.empty()){
std::cout << "Failed to load image!";
}
//transform to required model input
int new_width = 128;
int new_height = 128;
cv::resize(img, img, cv::Size( new_width, new_height));
img = (img - 127.5)/127.5 - 1.0;
//create model for the npu
const char* filename = "/usr/src/blazeface_npu.tflite";
const char* delegate_path = "/usr/lib/libvx_delegate.so";
auto model = tflite::FlatBufferModel::BuildFromFile(filename);
std::cout << " Ext delegate options " << std::endl;
auto ext_delegate_option = TfLiteExternalDelegateOptionsDefault(delegate_path);
std::cout << " Ext delegate options " << std::endl;
auto ext_delegate_ptr = TfLiteExternalDelegateCreate(&ext_delegate_option);
std::cout << " Ext delegate pointer " << std::endl;
if(ext_delegate_ptr == nullptr){
std::cout << " Ext delegate is null " << std::endl;
return 0;
}
tflite::ops::builtin::BuiltinOpResolver resolver;
resolver.AddCustom(kNbgCustomOp, tflite::ops::custom::Register_VSI_NPU_PRECOMPILED());
std::cout << " Resolver " << std::endl;
tflite::InterpreterBuilder builder(*model, resolver);
std::cout << " Builder " << std::endl;
std::unique_ptr<tflite::Interpreter> interpreter;
std::cout << " Interpreter " << std::endl;
std::cout << " Setup builder and intepreter " << std::endl;
interpreter->ModifyGraphWithDelegate(ext_delegate_ptr);
std::cout << " Modifying graph with delegate " << std::endl;
interpreter->AllocateTensors();
.
.
.
but i always get:
undefined reference to tflite::ops::custom::REGISTER_VSI_NPU_PRECOMPILED() do you know where the problem could be? @dwightk i tried to play around with the namespaces but i have no clue wher the problem is.
best wishes jean
Thank you so much for your quick response! I dont think this is the problem. Indeed i had this problem initially when i tried it first and i got exactly what you said that it could not load the header file. But then i included the vx delegate package in my recipe:
If I understand the debug output correctly, you are getting this error at compile time, not runtime.
1) Ensure that the .o object files are being created, which will confirm the compiler found the headers.
2) Double check the .vscode tasks.json where I will reference the libraries. Ensure all of those libraries are available to your compiler. Error may arise from secondary dependency, not just the primary library.
3) Find all those libraries on your board. (it's something like /usr/lib path) You should see all the required .so
4) Try commenting out that line or substituting it with a different line, sometimes it will give a different debug output which can help to identify if there is a particular missing library.
Reasons might be: 1) Your SDK or Yocto Image isn't the full version (eg. maybe you created multimedia SDK), which has some of the OpenCV etc. but not all the tensorflow. 2) Your header paths are not being found. 3) You're trying to compile the project directly on the board without the SDK present. I don't have an SDK on the board, I use arm64 cross compiler on desktop with full SDK installed and then drop the binary on the board to run it.
You understood the output correctly, the error was during compilation. Meanwhile i solved the problem. The header files loaded correctly but during the compilation of vx-delegate there are two libarys created. 1. libvx_delegate.so and 2. libvx_custom_op.a. The problem was that the static libary was not included or linked in the syspath of the yocto project. So i linked the static libary from the build folder of the vx-delegate recipe manually to the compiler flags of my recipe and everything worked fine!
Thanks for your help @dwightk, your post helped me a lot!
best wishes
I run into the same behaviour. Do yo have some code snippets, how you linked the vx_custom_op library in yocto?
Sorry for the late response. I solved it a bit hacky. I would suggest that to go to the build directory of tensorflow-lite-vx-delegate recipe. You can find it by initialising youre yocto project (source oe...) and then type in the terminal comand:
bitbake -e tensorflow-lite-vx-delegate | grep ^D=
You will get a path. Go there and browse a bit around and in some folder you will find the static libary vx_custom_op.a. Somehow it is not linked in the filesystem libarys of yocto. I solved it the brutal way:
I remembered wehre to find the static libary and made a path to reach to that point. In the compiler arguments i parse the path.
Best wishes
I think you are missing the header file:
#include vsi_npu_custom_op.h
https://github.com/nxp-imx/tflite-vx-delegate-imx/blob/lf-5.15.71_2.2.0/vsi_npu_custom_op.h
I may have created this header file based on the Github repo above.
This might work as is, if you create the file and drop it in the same folder as your program, or you might have to update the "common.h" reference for the correct path to your yocto sdk tensorflow headers.
#include "tensorflow/lite/c/common.h"
You can issue the find command from bash: find / -name "common.h"
Hello,
Which BSP does customer used? Is the tflite model fully quantized?
NPU on i.MX8MP can only support quantized Operators. Details can be seen in 16 OVXLIB Operation Support with NPU (https://www.nxp.com.cn/docs/en/user-guide/IMX-MACHINE-LEARNING-UG.pdf ). To get a best performance, please quantize the models by integer_only quantization ( https://www.tensorflow.org/lite/performance/post_training_quantization?hl=es-419#integer_only). Do not use the dynamic range quantization for models. It converts only the weights to 8-bit integers, but retains the activations in fp32, which results in the inference running in fp32 with an additional overhead for data conversion. In fact, the inference is even slower compared to a fp32 model, because the conversion is done on the fly.
You can also use benchmark_model located at "/usr/bin/tensorflow-lite-2.9.1/examples" in default BSP to profile the model to see if all of the Ops runs on NPU. For example,
root@imx8mpevk:~# cd /usr/bin/tensorflow-lite-2.9.1/examples/
root@imx8mpevk:/usr/bin/tensorflow-lite-2.9.1/examples# ./benchmark_model --graph=./mobilenet_v1_1.0_224_quant.tflite --external_delegate_path=/usr/lib/libvx_delegate.so --enable_op_profiling=true
Then you will see the results as below:
The "Vx Delegate" node means running on NPU. If there are other nodes, it means the other nodes fall back to CPU, which will affect the inference performance related to models.
Regards
Hello,
I am using:
Linux imx8mpevk 6.1.22+g66e442bc7fdc #1 SMP PREEMPT Mon Jun 12 12:31:27 UTC 2023 aarch64 GNU/Linux ---> NXP i.MX Release Distro 6.1-mickledore \n \l
The model is fully quantized but it has floating point inputs and outputs.
Somewhere in the documentation I read that it will default to XNNPACK delegate if there are floating points.
I am running the python code on the same machine, with same model, and it is correctly invoking the NPU, so the problem is not the hardware, or the model quantization.
I have tested the benchmark model and it is running fine.
You can see in the time and in the output of the python, it doesn't print the XNNPACK delegate and also the first time it takes longer but then it is 0.5 seconds, which is consistent with the expected behavior with hardware acceleration, whereas with CPU each run takes the same amount of time. If I don't specify the vx_delegate, python also prints out the XNNPACK delegate info. and has similar behavior as C++.
There is no doubt it is correctly using the NPU and the vx_delegate with python.
The problem has to do with the C++ behavior is different from Python. It seems to be wanting to override the specified vx_delegate and default to XNNPACK. I am looking for a line of code to force it to try the vx_delegate first, as somehow it is deciding to use the XNNPACK not the specified vx_delegate.
Hello,
Can you clarify to me : you wants to run the model using NPU/GPU?
If so: The i.MX8MP board runs by default on CPU (XNN PACK delegate) to change this you need to add the delegate ".so" file path within the code, which I don´t see in the part of the code you posted. can you share to me the Client´s full code so I can test it on my side adding those lines of delegate?
Regards
Hello,
Please find attached code as you requested. There is a python folder and a cpp folder. The cpp folder already has the precompiled "server" file.
The attachment is too large, so I am including it as Google Drive link below:
https://drive.google.com/drive/folders/1zESQauOTws3VFeLmt7lVwlLBDfykdCEH?usp=sharing
you can invoke the python on the imx8mplus evk with:
python3 test_model.py
to compile the CPP file, I am using Visual Studio code as the IDE.
1) in the command line run your source command to source into the sdk:
eg. source /opt/fsl-imx-xwayland/5.10-hardknott/environment-setup-cortexa53-crypto-poky-linux
2) unzip and navigate into the cpp folder in above Google Drive link
3) run the command code . from inside the cpp folder.
4) In tasks.json file in the .vscode folder update the .so references eg. libopencv_core.so.4.5 to the version based on the bsp
5) with test.cpp file selected in the active window run the command "recompile executable with new file"
6) It will output a server executable in the env folder which can be copied onto the board, you will also need to copy the .tflite model which is included, and the .jpg files which are used for testing.
7) test.cpp has function called process_4 in this one I am invoking the tensorflow with vx_delgate.
The python code with the model runs correctly with hardware acceleration
The C++ code does not run correctly with same model. IE no hardware acceleration is observed even with vx_delegate specified.
I am looking for the code change required to get the C++ to work exactly like the python does.