Vx_delegate invoking in Python but not C++ for IMXMPLUS

dwightk · ‎09-06-2023

Hi all,

I have a TFlite model. Specified external delegate with python. Observe correct behavior. First warmup run takes ~46 seconds, thereafter 0.5 seconds per execution.

Converted code to C++. Specified external_delegate options with vx_delegate. Options initialize, but it still uses XNNPACK delegate when the builder and interpreter are constructed.

C++ behavior shows that vx_delegate is not being called, first warmup run and all runs thereafter take ~46 seconds, ie no hardware acceleration is occurring.

The model has float inputs and outputs. This didn't bother the python. C++ I tried specifiying

interpreter->SetAllowFp16PrecisionForFP32(true);

This caused a segmentation fault. So I have removed it.

Python code:

setup the interpreter:

self.interpreter = tflite.Interpreter(model_path='./model_integer_quant.tflite', experimental_delegates=[tflite.load_delegate('/usr/lib/libvx_delegate.so')])
self.interpreter.allocate_tensors()
self.input_details = self.interpreter.get_input_details()
self.output_details = self.interpreter.get_output_details()

# check the type of the input tensor
self.floating_model = self.input_details[0]['dtype'] == np.float32

print("self.floating_model is ", self.floating_model)

invoke the interpreter:

def detect(self, image, object=None):
interpreter = self.interpreter

input_data = self.preprocess(image)

interpreter.set_tensor(self.input_details[0]['index'], input_data)
# start_time = time.time()
interpreter.invoke()
# stop_time = time.time()
output_data = interpreter.get_tensor(self.output_details[0]['index'])

results = self.postprocess(output_data)

if object is not None:
results = [result for result in results if result['cls_name'] == object]

return results

preprocessing inputs:

def preprocess(self, image):
# load image
if type(image) == str: # Load from file path
if not os.path.isfile(image):
raise ValueError("Input image file path (", image, ") does not exist.")
image = cv.imread(image)
elif isinstance(image, np.ndarray): # Use given NumPy array
image = image.copy()
else:
raise ValueError("Invalid image input. Only file paths or a NumPy array accepted.")

self.img_height = image.shape[0]
self.img_width = image.shape[1]

# resize and padding
image = self.letterbox(image)

# BGR -> RGB
image = image[:, :, ::-1]
# image = cv.cvtColor(image, cv.COLOR_BGR2RGB)

# add N dim
input_data = np.expand_dims(image, axis=0)

if self.floating_model:
input_data = np.float32(input_data) / 255 // input data is np.float32
else:
input_data = input_data.astype(np.int8)

return input_data

Python output:

Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.

self.floating_model is True
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
Processing ./image0frame37373.jpg - time: 47.29401898384094 s
Processing ./image0frame37954.jpg - time: 0.4757547378540039 s
Processing ./image0frame40189.jpg - time: 0.4649374485015869 s
Processing ./image0frame30.jpg - time: 0.46975016593933105 s
Processing ./image0frame937.jpg - time: 0.46828198432922363 s
Processing ./image0frame674.jpg - time: 0.46784138679504395 s
Processing ./image0frame37487.jpg - time: 0.4682295322418213 s
Processing ./image0frame36965.jpg - time: 0.475665807723999 s
Processing ./image0frame40527.jpg - time: 0.4699666500091553 s
Processing ./image0frame852.jpg - time: 0.4759962558746338 s
Processing ./result1.jpg - time: 0.46816182136535645 s
Processing ./image0frame1.jpg - time: 0.4685957431793213 s
Processing ./image0frame40183.jpg - time: 0.46500706672668457 s
Processing ./image0frame36962.jpg - time: 0.47954559326171875 s
Processing ./image0frame842.jpg - time: 0.472883939743042 s
Processing ./image0frame38968.jpg - time: 0.4709169864654541 s
Processing ./result0.jpg - time: 0.46674442291259766 s
Processing ./viper_snapshot.jpg - time: 0.4655272960662842 s
Processing ./image0frame40808.jpg - time: 0.464113712310791 s
Processing ./image0frame40814.jpg - time: 0.4668314456939697 s
Processing ./image0frame668.jpg - time: 0.46632933616638184 s
Processing ./image0frame900.jpg - time: 0.47098541259765625 s
Processing ./image0frame37379.jpg - time: 0.4749300479888916 s
Processing ./image0frame0.jpg - time: 0.46938037872314453 s
Processing ./image0frame420.jpg - time: 0.4686119556427002 s
Processing ./image0frame37956.jpg - time: 0.477811336517334 s
Processing ./image0frame38974.jpg - time: 0.4715125560760498 s
Processing ./image0frame40532.jpg - time: 0.475177526473999 s
Processing ./image0frame37484.jpg - time: 0.4749460220336914 s

Above code is working, I try to do the same thing with C++:

std::unique_ptr<tflite::FlatBufferModel> model =

tflite::FlatBufferModel::BuildFromFile("model_integer_quant.tflite");

cout << " Got tflite model " << endl;

auto ext_delegate_option = TfLiteExternalDelegateOptionsDefault("/usr/lib/libvx_delegate.so");

cout << " Ext delegate options " << endl;

auto ext_delegate_ptr = TfLiteExternalDelegateCreate(&ext_delegate_option);

cout << " Ext delegate pointer " << endl;

if(ext_delegate_ptr == nullptr){

cout << " Ext delegate is null " << endl;

return *inp;

}

tflite::ops::builtin::BuiltinOpResolver resolver;

resolver.AddCustom(kNbgCustomOp, tflite::ops::custom::Register_VSI_NPU_PRECOMPILED());

cout << " Resolver " << endl;

tflite::InterpreterBuilder builder(*model, resolver);

cout << " Builder " << endl;

std::unique_ptr<tflite::Interpreter> interpreter;

cout << " Interpreter " << endl;

//interpreter->SetAllowFp16PrecisionForFp32() //commented out because of segmentation fault

cout << " Set precision " << endl;

builder(&interpreter); //Output shows: INFO: Created TensorFlow Lite XNNPACK delegate for CPU. ????

cout << " Setup builder and intepreter " << endl;

interpreter->ModifyGraphWithDelegate(ext_delegate_ptr);

cout << " Modifying graph with delegate " << endl;

//tflite::PrintInterpreterState(interpreter.get());

interpreter->AllocateTensors();

cout << " Got model " << endl;

// get input & output layer

TfLiteTensor *input_tensor = interpreter->tensor(interpreter->inputs()[0]);

cout << " Got input " << endl;

const uint HEIGHT = input_tensor->dims->data[1];

const uint WIDTH = input_tensor->dims->data[2];

const uint CHANNEL = input_tensor->dims->data[3];

cout << "H " << HEIGHT << " W " << WIDTH << " C " << CHANNEL << endl;

// read image file

cv::Mat img;

if (inp == NULL)

{

cout << "Getting image from file " << endl;

img = cv::imread(infile);

}

else

{

cout << "Getting image from input " << endl;

img = *inp;

}

cv::Mat inputImg = mat_process(img, WIDTH, HEIGHT);

// flatten rgb image to input layer.

float *inputImg_ptr = inputImg.ptr<float>(0);

memcpy(input_tensor->data.f, inputImg.ptr<float>(0),

WIDTH * HEIGHT * CHANNEL * sizeof(float));

interpreter->Invoke();

float *output1 = interpreter->typed_output_tensor<float>(0);

C++ Output

Tensorflow Test
Reading image
IMAGE SIZE IS 281776
Reading image
IMAGE SIZE IS 348944
Got tflite model
Ext delegate options
Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.
Ext delegate pointer
Resolver
Builder
Interpreter
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Setup builder and intepreter
Modifying graph with delegate
Got model
Got input
Got output
Got output score
H 640 W 640 C 3
Getting image from input
Read matrix from file s: 0.0000083760
Creating dst
Creating dst2
Creating dst3
Creating dst4
Creating dst5
Creating dst6
Creating dst7
Got image
Process Matrix to RGB s: 0.1358635630
GOT MEMCPY
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 19: default layout inference pass.
W [HandleLayoutInfer:257]Op 18: default layout inference pass.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.
W [op_optimize:676]stride slice copy tensor.

dwightk · ‎09-13-2023

Hello,

I have figured out the problem. The issue was that the interpreter is being created each time in the function. I needed to initialize the interpreter once and then pass it into the function, rather than creating it anew each time.

Once I did this, I got the expected behaviour. Thanks.

View solution in original post

dwightk · ‎09-13-2023

Hello,

I have figured out the problem. The issue was that the interpreter is being created each time in the function. I needed to initialize the interpreter once and then pass it into the function, rather than creating it anew each time.

Once I did this, I got the expected behaviour. Thanks.

JeFi · ‎02-26-2024

I followed your suggestion to setup the model for the npu and came upt with this code: My model is already working in python on the npu.

#include <stdio.h>
#include <iostream>
#include <tensorflow/lite/interpreter.h>
#include <tensorflow/lite/kernels/register.h>
#include <ctime>
#include <cstdlib>
#include <vector>
#include <tensorflow/lite/c/common.h>
#include <tensorflow/lite/model.h>
#include <memory>
#include <tensorflow/lite/tools/gen_op_registration.h>
#include <tensorflow/lite/delegates/nnapi/nnapi_delegate.h>
#include <tensorflow/lite/delegates/external/external_delegate.h>
#include <opencv2/opencv.hpp>
#include <vector>
#include <chrono>
#include <tensorflow-lite-vx-delegate/vsi_npu_custom_op.h>
#include <tensorflow-lite-vx-delegate/delegate_main.h>

int main() {
//load image
std::string image_path = "/usr/src/Screenshot.png";
cv::Mat img = cv::imread(image_path);
if (img.empty()){
std::cout << "Failed to load image!";
}

//transform to required model input
int new_width = 128;
int new_height = 128;
cv::resize(img, img, cv::Size( new_width, new_height));
img = (img - 127.5)/127.5 - 1.0;

//create model for the npu
const char* filename = "/usr/src/blazeface_npu.tflite";
const char* delegate_path = "/usr/lib/libvx_delegate.so";
auto model = tflite::FlatBufferModel::BuildFromFile(filename);
std::cout << " Ext delegate options " << std::endl;
auto ext_delegate_option = TfLiteExternalDelegateOptionsDefault(delegate_path);
std::cout << " Ext delegate options " << std::endl;
auto ext_delegate_ptr = TfLiteExternalDelegateCreate(&ext_delegate_option);

std::cout << " Ext delegate pointer " << std::endl;
if(ext_delegate_ptr == nullptr){
std::cout << " Ext delegate is null " << std::endl;
return 0;
}
tflite::ops::builtin::BuiltinOpResolver resolver;
resolver.AddCustom(kNbgCustomOp, tflite::ops::custom::Register_VSI_NPU_PRECOMPILED());
std::cout << " Resolver " << std::endl;

tflite::InterpreterBuilder builder(*model, resolver);
std::cout << " Builder " << std::endl;
std::unique_ptr<tflite::Interpreter> interpreter;
std::cout << " Interpreter " << std::endl;

std::cout << " Setup builder and intepreter " << std::endl;
interpreter->ModifyGraphWithDelegate(ext_delegate_ptr);
std::cout << " Modifying graph with delegate " << std::endl;
interpreter->AllocateTensors();

.

but i always get:

undefined reference to tflite::ops::custom::REGISTER_VSI_NPU_PRECOMPILED() do you know where the problem could be? @dwightk i tried to play around with the namespaces but i have no clue wher the problem is.

best wishes jean

JeFi · ‎02-27-2024

Thank you so much for your quick response! I dont think this is the problem. Indeed i had this problem initially when i tried it first and i got exactly what you said that it could not load the header file. But then i included the vx delegate package in my recipe:

SUMMARY = "bf on npu"

LICENSE = "CLOSED"

# Specify the version of the image

IMAGE_VERSION = "1.0"

S = "${WORKDIR}"

# Specify the dependencies for the image

DEPENDS += "opencv"

DEPENDS += "tensorflow-lite"

DEPENDS += "flatbuffers"

DEPENDS += "tensorflow-lite-vx-delegate"

DEPENDS += "tim-vx"

DEPENDS += "abseil-cpp"

# Define the recipe for the image

# Define the recipe for the main.cpp file

SRC_URI += "file://fd.cpp \

file://Screenshot.png \

file://blazeface_npu.tflite \

"

do_compile() {

${CXX} fd.cpp -o blazeface_npu ${CFLAGS} ${LDFLAGS} -lopencv_gapi -lopencv_highgui -lopencv_ml \

-lopencv_objdetect -lopencv_photo -lopencv_stitching -lopencv_video \

-lopencv_calib3d -lopencv_features2d -lopencv_dnn -lopencv_flann \

-lopencv_videoio -lopencv_imgcodecs -lopencv_imgproc -lopencv_core -ltensorflow-lite

}

do_install() {

install -d ${D}${bindir}

install -d ${D}/usr/src

install -m 0755 blazeface_npu ${D}${bindir}

install -m 0755 Screenshot.png ${D}/usr/src

install -m 0755 blazeface_npu.tflite ${D}/usr/src

}

I am compiling the program during the yocto build and it works already for the cpu(so i know that header and libarys files for opencv and tensorflow lite are loaded correctly.). I have only the binarys on my board in the end.

I also included the header files you mentioned in the script above:

#include <tensorflow-lite-vx-delegate/delegate_main.h>

#include <tensorflow-lite-vx-delegate/vsi_npu_custom_op.h>

This is the error i receive for the script above:

| DEBUG: Executing shell function do_compile
| facedetectionnpu.cpp: In function 'int main()':
| facedetectionnpu.cpp:74:41: warning: 'this' pointer is null [-Wnonnull]
| 74 | interpreter->ModifyGraphWithDelegate(ext_delegate_ptr);
| | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
| In file included from facedetectionnpu.cpp:3:
| /build/image-provisioning-tmp/work/cortexa53-crypto-mx8mp-linux/fd/1.0-r0/recipe-sysroot/usr/include/tensorflow/lite/interpreter.h:601:16: note: in a call to no>
| 601 | TfLiteStatus ModifyGraphWithDelegate(TfLiteDelegate* delegate);
| | ^~~~~~~~~~~~~~~~~~~~~~~
| fd.cpp:76:33: warning: 'this' pointer is null [-Wnonnull]
| 76 | interpreter->AllocateTensors();
| | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
| In file included from facedetectionnpu.cpp:3:
| /build/image-provisioning-tmp/work/cortexa53-crypto-mx8mp-linux/fd/1.0-r0/recipe-sysroot/usr/include/tensorflow/lite/interpreter.h:528:16: note: in a call to no>
| 528 | TfLiteStatus AllocateTensors();
| | ^~~~~~~~~~~~~~~
| facedetectionnpu.cpp:95:28: warning: 'this' pointer is null [-Wnonnull]
| 95 | interpreter->Invoke();
| | ~~~~~~~~~~~~~~~~~~~^~
| In file included from facedetectionnpu.cpp:3:
| /build/image-provisioning-tmp/work/cortexa53-crypto-mx8mp-linux/fd/1.0-r0/recipe-sysroot/usr/include/tensorflow/lite/interpreter.h:536:16: note: in a call to no>
| 536 | TfLiteStatus Invoke();
| | ^~~~~~
| /build/image-provisioning-tmp/work/cortexa53-crypto-mx8mp-linux/fd/1.0-r0/recipe-sysroot-native/usr/bin/aarch64-linux/../../libexec/aarch64-linux/gcc/>
| /usr/src/debug/fd/1.0-r0/fd.cpp:64: undefined reference to `tflite::ops::custom::Register_VSI_NPU_PRECOMPILED()'
| collect2: error: ld returned 1 exit status
| WARNING: exit code 1 from a shell command.

and i dont get a error that the header files cant be found, so i am assuming that this works.

Do you have an sdk on your board so that you compile it on the board directly? If so, how did you get the header files for the libarys onto the board?

Thank you in advance!

best wishes

Jean

dwightk · ‎02-27-2024

If I understand the debug output correctly, you are getting this error at compile time, not runtime.

1) Ensure that the .o object files are being created, which will confirm the compiler found the headers.

2) Double check the .vscode tasks.json where I will reference the libraries. Ensure all of those libraries are available to your compiler. Error may arise from secondary dependency, not just the primary library.

3) Find all those libraries on your board. (it's something like /usr/lib path) You should see all the required .so

4) Try commenting out that line or substituting it with a different line, sometimes it will give a different debug output which can help to identify if there is a particular missing library.

Reasons might be: 1) Your SDK or Yocto Image isn't the full version (eg. maybe you created multimedia SDK), which has some of the OpenCV etc. but not all the tensorflow. 2) Your header paths are not being found. 3) You're trying to compile the project directly on the board without the SDK present. I don't have an SDK on the board, I use arm64 cross compiler on desktop with full SDK installed and then drop the binary on the board to run it.

JeFi · ‎02-29-2024

You understood the output correctly, the error was during compilation. Meanwhile i solved the problem. The header files loaded correctly but during the compilation of vx-delegate there are two libarys created. 1. libvx_delegate.so and 2. libvx_custom_op.a. The problem was that the static libary was not included or linked in the syspath of the yocto project. So i linked the static libary from the build folder of the vx-delegate recipe manually to the compiler flags of my recipe and everything worked fine!

Thanks for your help @dwightk, your post helped me a lot!

best wishes

whale-1958 · ‎04-05-2024

I run into the same behaviour. Do yo have some code snippets, how you linked the vx_custom_op library in yocto?

JeFi · ‎04-16-2024

Sorry for the late response. I solved it a bit hacky. I would suggest that to go to the build directory of tensorflow-lite-vx-delegate recipe. You can find it by initialising youre yocto project (source oe...) and then type in the terminal comand:

bitbake -e tensorflow-lite-vx-delegate | grep ^D=

You will get a path. Go there and browse a bit around and in some folder you will find the static libary vx_custom_op.a. Somehow it is not linked in the filesystem libarys of yocto. I solved it the brutal way:

SUMMARY = "bf on npu"

LICENSE = "CLOSED"

# Specify the version of the image

IMAGE_VERSION = "1.0"

S = "${WORKDIR}"

# Specify the dependencies for the image

DEPENDS += "opencv"

DEPENDS += "tensorflow-lite"

DEPENDS += "flatbuffers"

DEPENDS += "tensorflow-lite-vx-delegate"

DEPENDS += "abseil-cpp"

DEPENDS += "tim-vx"

PATH_TO_VX = "/../../../linux/tensorflow-lite-vx-delegate/2.9.1-r0/build"

WD = "${WORKDIR}"

CUSTOM_PATH = "${WD}${PATH_TO_VX}"

# Define the recipe for the image

#inherit core-image

# Define the recipe for the main.cpp file

SRC_URI += "file://fd.cpp \

file://Screenshot.png \

file://bf_npu.tflite \

file://vsi_npu_custom_op.h \

"

do_compile() {

${CXX} fd.cpp -o bf_npu ${CFLAGS} ${LDFLAGS} -I/${WORKDIR} -ltim-vx -ltensorflow-lite \

-L${CUSTOM_PATH} -lvx_custom_op \

-lopencv_gapi -lopencv_highgui -lopencv_ml \

-lopencv_objdetect -lopencv_photo -lopencv_stitching -lopencv_video \

-lopencv_calib3d -lopencv_features2d -lopencv_dnn -lopencv_flann \

-lopencv_videoio -lopencv_imgcodecs -lopencv_imgproc -lopencv_core

}

I remembered wehre to find the static libary and made a path to reach to that point. In the compiler arguments i parse the path.

Best wishes

dwightk · ‎02-26-2024

I think you are missing the header file:

#include vsi_npu_custom_op.h

https://github.com/nxp-imx/tflite-vx-delegate-imx/blob/lf-5.15.71_2.2.0/vsi_npu_custom_op.h

I may have created this header file based on the Github repo above.

This might work as is, if you create the file and drop it in the same folder as your program, or you might have to update the "common.h" reference for the correct path to your yocto sdk tensorflow headers.

#include "tensorflow/lite/c/common.h"

You can issue the find command from bash: find / -name "common.h"

Bio_TICFSL · ‎09-07-2023

Hello,

Which BSP does customer used? Is the tflite model fully quantized?

NPU on i.MX8MP can only support quantized Operators. Details can be seen in 16 OVXLIB Operation Support with NPU (https://www.nxp.com.cn/docs/en/user-guide/IMX-MACHINE-LEARNING-UG.pdf ). To get a best performance, please quantize the models by integer_only quantization ( https://www.tensorflow.org/lite/performance/post_training_quantization?hl=es-419#integer_only). Do not use the dynamic range quantization for models. It converts only the weights to 8-bit integers, but retains the activations in fp32, which results in the inference running in fp32 with an additional overhead for data conversion. In fact, the inference is even slower compared to a fp32 model, because the conversion is done on the fly.

You can also use benchmark_model located at "/usr/bin/tensorflow-lite-2.9.1/examples" in default BSP to profile the model to see if all of the Ops runs on NPU. For example,

root@imx8mpevk:~# cd /usr/bin/tensorflow-lite-2.9.1/examples/
root@imx8mpevk:/usr/bin/tensorflow-lite-2.9.1/examples# ./benchmark_model --graph=./mobilenet_v1_1.0_224_quant.tflite --external_delegate_path=/usr/lib/libvx_delegate.so --enable_op_profiling=true

Then you will see the results as below:

The "Vx Delegate" node means running on NPU. If there are other nodes, it means the other nodes fall back to CPU, which will affect the inference performance related to models.

Regards

dwightk · ‎09-07-2023

Hello,

I am using:

Linux imx8mpevk 6.1.22+g66e442bc7fdc #1 SMP PREEMPT Mon Jun 12 12:31:27 UTC 2023 aarch64 GNU/Linux ---> NXP i.MX Release Distro 6.1-mickledore \n \l

The model is fully quantized but it has floating point inputs and outputs.

Somewhere in the documentation I read that it will default to XNNPACK delegate if there are floating points.

I am running the python code on the same machine, with same model, and it is correctly invoking the NPU, so the problem is not the hardware, or the model quantization.

I have tested the benchmark model and it is running fine.

You can see in the time and in the output of the python, it doesn't print the XNNPACK delegate and also the first time it takes longer but then it is 0.5 seconds, which is consistent with the expected behavior with hardware acceleration, whereas with CPU each run takes the same amount of time. If I don't specify the vx_delegate, python also prints out the XNNPACK delegate info. and has similar behavior as C++.

There is no doubt it is correctly using the NPU and the vx_delegate with python.

The problem has to do with the C++ behavior is different from Python. It seems to be wanting to override the specified vx_delegate and default to XNNPACK. I am looking for a line of code to force it to try the vx_delegate first, as somehow it is deciding to use the XNNPACK not the specified vx_delegate.

tflite::ops::builtin::BuiltinOpResolver resolver;

resolver.AddCustom(kNbgCustomOp, tflite::ops::custom::Register_VSI_NPU_PRECOMPILED());

cout << " Resolver " << endl;

tflite::InterpreterBuilder builder(*model, resolver);

cout << " Builder " << endl;

std::unique_ptr<tflite::Interpreter> interpreter;

cout << " Interpreter " << endl;

//interpreter->SetAllowFp16PrecisionForFp32() //commented out because of segmentation fault

cout << " Set precision " << endl;

builder(&interpreter); //Output shows: INFO: Created TensorFlow Lite XNNPACK delegate for CPU. ????

cout << " Setup builder and intepreter " << endl;

interpreter->ModifyGraphWithDelegate(ext_delegate_ptr);

cout << " Modifying graph with delegate " << endl;

-----------

for further clarity see the python output differences if I change the line that specifies the vx_delegate:

self.interpreter = tflite.Interpreter(model_path='./model_integer_quant.tflite', experimental_delegates=[tflite.load_delegate('/usr/lib/libvx_delegate.so')]) >>>>>> self.interpreter = tflite.Interpreter(model_path='./model_integer_quant.tflite' )

python output with vx_delegate:

Vx delegate: allowed_cache_mode set to 0.
Vx delegate: device num set to 0.
Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.
W [HandleLayoutInfer:281]Op 162: default layout inference pass.
W [HandleLayoutInfer:281]Op 162: default layout inference pass.
W [HandleLayoutInfer:281]Op 162: default layout inference pass.
W [HandleLayoutInfer:281]Op 162: default layout inference pass.
W [HandleLayoutInfer:281]Op 162: default layout inference pass.
Processing ./image0frame37487.jpg - time: 86.78495478630066 s
Processing ./image0frame40808.jpg - time: 0.39295029640197754 s -----> first run takes longer, then subsequent runs are fast
Processing ./image0frame420.jpg - time: 0.39923834800720215 s
Processing ./image0frame40183.jpg - time: 0.3918418884277344 s
Processing ./image0frame852.jpg - time: 0.4001171588897705 s

python output without vx_delegate:

INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Processing ./image0frame37487.jpg - time: 17.06532335281372 s
Processing ./image0frame40808.jpg - time: 16.96425437927246 s ---> all runs take the same time with CPU
Processing ./image0frame420.jpg - time: 16.964816570281982 s
Processing ./image0frame40183.jpg - time: 16.957589387893677 s
Processing ./image0frame852.jpg - time: 16.96754550933838 s

Bio_TICFSL · ‎09-11-2023

Hello,

Can you clarify to me : you wants to run the model using NPU/GPU?
If so: The i.MX8MP board runs by default on CPU (XNN PACK delegate) to change this you need to add the delegate ".so" file path within the code, which I don´t see in the part of the code you posted. can you share to me the Client´s full code so I can test it on my side adding those lines of delegate?

Regards

dwightk · ‎09-12-2023

Hello,

Please find attached code as you requested. There is a python folder and a cpp folder. The cpp folder already has the precompiled "server" file.

The attachment is too large, so I am including it as Google Drive link below:

https://drive.google.com/drive/folders/1zESQauOTws3VFeLmt7lVwlLBDfykdCEH?usp=sharing

you can invoke the python on the imx8mplus evk with:

python3 test_model.py

to compile the CPP file, I am using Visual Studio code as the IDE.

1) in the command line run your source command to source into the sdk:

eg. source /opt/fsl-imx-xwayland/5.10-hardknott/environment-setup-cortexa53-crypto-poky-linux

2) unzip and navigate into the cpp folder in above Google Drive link

3) run the command code . from inside the cpp folder.

4) In tasks.json file in the .vscode folder update the .so references eg. libopencv_core.so.4.5 to the version based on the bsp

5) with test.cpp file selected in the active window run the command "recompile executable with new file"

6) It will output a server executable in the env folder which can be copied onto the board, you will also need to copy the .tflite model which is included, and the .jpg files which are used for testing.

7) test.cpp has function called process_4 in this one I am invoking the tensorflow with vx_delgate.

The python code with the model runs correctly with hardware acceleration

The C++ code does not run correctly with same model. IE no hardware acceleration is observed even with vx_delegate specified.

I am looking for the code change required to get the C++ to work exactly like the python does.