2270170_en-US

Yes

Hello NPU! Running a TFLite model on i.MX 9

The following is a guide on training a simple model in Pytorch and Tensorflow and deploying it on an application using the i.MX93 Ethos-65 Neural Processing Unit (NPU).

After following this guide you will accomplish:

Training a simple CNN on the MNIST dataset
Convert the model to tflite, quantize it and compile it for the i.MX93 NPU (Ethos-65).
Run a simple application where a digit can be drawn and identified by our model.

Prerequisites

To follow this guide you will need:

Yocto image, GTKMM3 support is needed for the C++ example, for the python example a pre-built image can be used.
An i.MX93 board

Running Python example

The application implementation is provided in both Python and C++, if using the python application, pre-built full image can be used instead, simply copy the python scripts to the target and execute as follows:

# Running quantized example on the CPU
./run.py -m cnn_tf_quant.tflite
# Running example on the Ethos NPU
./run.py -m cnn_tf_quant_vela.tflite -d /usr/lib/liblitert_ethosu_delegate.so

Pre-built models are provided in the attachment however steps and scripts used to train and generate the models are also included (see below).

Building image with GTKMM3 support (C++ example only)

The GUI application used for demonstration has been written in GTKMM3 (C++ wrapper of the GTK library) therefore an image with GTKMM3 support is needed, luckily there is already a recipe we can use to easily integrate this into our yocto image.

To build the image simply follow the instructions in the Yocto User's guide, as of the time of this writing the latest BSP is 6.12.49_2.2.0 so we will use that.

Once you have setup all the requirements in your host and installed repo, you can setup your build enviroment as follows:

repo init -u https://github.com/nxp-imx/imx-manifest -b imx-linux-walnascar -m imx-6.12.49-2.2.0.xml

repo sync

Depending on your target you can now setup your build directory, we will use wayland graphics with X11 support, and the iMX93 Freedom board as example:

DISTRO=fsl-imx-xwayland MACHINE=imx93-11x11-lpddr4x-frdm source imx-setup-release.sh -b 93-frdm-xwayland

Simply select the MACHINE configuration that matches your board.

Now we're almost ready to start the build, we still need to add GTKMM3 support to our image, simply modify your local.conf file under conf/local.conf and add the following:

IMAGE_INSTALL:append = " gtkmm3"

Make sure the space in front of gtkmm3 is there to avoid issues on the build. Since the build is very resource intensive out of memory issues can arise during the build, to limit the amount of concurrent recipes attempted to build at once it is recommended to add the following as well:

BB_NUMBER_THREADS="8"

PARALLEL_MAKE="-j8"

BB_PRESSURE_MAX_CPU ?= "50000"

BB_PRESSURE_MAX_IO ?= "100000"

BB_PRESSURE_MAX_MEMORY ?= "25000"

After this your local.conf should look similar to this:

Screenshot from 2025-12-30 13-00-44.png

NOTE: Make sure to have plenty of storage available on your machine since the build requires upwards of 500GB to complete.

The build can now start, if you want to build the GTKMM application from source it is required to have an available SDK create it as follows:

bitbake imx-image-full -c populate_sdk

And to create the image simply do:

bitbake imx-image-full

We require the full image since it contains all the Tensorflow Lite libraries and different examples.

After the build completes the toolchain can be installed and the image flashed onto the board.

To install the toolchain:

./tmp/deploy/sdk/fsl-imx-xwayland-glibc-x86_64-imx-image-full-armv8a-imx93-11x11-lpddr4x-frdm-toolchain-6.12-walnascar.sh

And afterwards every time you want to use the toolchain:

source /opt/fsl-imx-xwayland/6.12-walnascar-full-gtkmm3/environment-setup-armv8a-poky-linux

To flash the image to an SD card:

zstdcat imx-image-full-imx93-11x11-lpddr4x-frdm.rootfs.wic.zst | sudo dd of=/dev/mmcblk0 bs=1M conv=fsync

And now you are ready to build the application, train some models and deploy them.

Building the GTKMM3 application (C++)

The source for the application can be found here, a prebuilt binary is also provided and attached here.

The application contains a drawing area where one can simply draw a digit with the mouse or touch display, and two buttons one to clear the drawing area and one to trigger the execution of the model and predict the digit.

To build from scratch CMake is required, as well as a toolchain with support for GTKMM3 (see above), the following steps can be followed to build the project:

sudo apt install cmake

git clone https://github.com/ManRod2982/drawing_window_imx

cd drawing_window_imx/drawing_window_cpp/

source /opt/fsl-imx-xwayland/6.12-walnascar-full-gtkmm3/environment-setup-armv8a-poky-linux

cmake -B build -DCMAKE_TOOLCHAIN_FILE=$OECORE_NATIVE_SYSROOT/usr/share/cmake/OEToolchainConfig.cmake

cmake --build build

After this a binary called window will be created under the build directory, now it can be simply copied to the target SD card.

If using linux the filesystem will be mounted, so you can simply copy the binary to the root directory:

sudo cp build/window /media/user/root/root/

SCP can also be used if a connection to the board is already established:

scp build/window root@192.168.x.x:/root

And after this on the target the application can be started as follows:

./window -m model_path [optional] -d delegate_path [optional] -v

Three parameters are accepted by the application:

Path to the model: -m or --model_path
[Optional] Path to the delegate if any: -d or --delegate_path, if none is provided the model will be attempted to be run on the CPU using the XNN delegate
[Optional] Verbosity flag, if present the model will output more information

Now we need a model to run.

Training a simple CNN model

Looking at the Machine Learning User's guide for this release. The following is the support for the different frameworks with respect to the available compute engines in each device:

Screenshot from 2025-12-30 16-24-57.png

Tensorflow Lite and LiteRT (latest release of Tensorflow Lite and the only one moving forward) are the frameworks that are widely supported for most compute engines in the i.MX9 family, this guide will use Tensorflow lite since the example uses C++ and the current release of LiteRT only supports Python, however the interface and process it's pretty much the same.

Setting up the environment

The example repository contains different python scripts used to train the models and convert them to the tflite format. In order to follow the next steps a python3 installation is necessary.

It is recommended to setup a virtual environment:

python3 -m venv myenv
source myenv/bin/activate
pip install -r requirements.txt

This will install all the required packages for both Tensorflow and Pytorch.

Training a model with Tensorflow

Tensorflow allows an straightforward path to quantize and convert the model to Tensorflow Lite, our Convolutional Neural Network (CNN) architecture looks as follows:

model = tf.keras.models.Sequential([
  tf.keras.layers.Input(batch_shape=(1, 28, 28, 1)),
  tf.keras.layers.Conv2D(16, 5, padding='same', activation='relu'),
  tf.keras.layers.Conv2D(32, 3, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.MaxPool2D(2, strides=(2,2)),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(100, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])

We can train the model by running the script train_tf.py, it takes around 2min to train on a normal laptop and achieves 99.05% accuracy on the test dataset. For details on the framework please refer to the official Tensorflow documentation.

After running the script we can visualize our model using the eIQ toolkit model visualizer or the Netron.app:

cnn_tf.keras (1).png

The i.MX93 features an ARM Ethos-65 NPU which requires the weights, biases and inputs to be integers and our current model uses float32, therefore we need to quantize the model, to achieve this we can run tf2quant_tflite.py which will quantize the model and convert it to tflite:

cnn_tf_quant.tflite.png

Which we can now see takes integer inputs and outputs, the weights and biases have also been quantized and we can easily see the difference in size of the files:

Screenshot from 2025-12-30 17-46-15.png

The quantized model is 555kB whereas the float32 model is 2.2MB, since float32 requires 4 bytes to store each weight and bias, whereas the quantized model requires only one byte.

You now have a model than can be used on the target, however as it is right now it will be run on the CPU using the XNN delegate, to run the model simply do:

./window -m cnn_tf_quant.tflite

We can now compile our quantized model for the ARM Ethos NPU. The eIQ toolkit will be used. Open the model through the model tool:

Navigate to the folder with your quantized model, cnn_tf_quant.tflite in this case and open it, you should be able to visualize the model, now we can click on the options menu to select convert:

We select the i.MX93 converter, we will prompted to select the destination folder as well:

After selecting the destination folder if all goes well the conversion finalizes and we should be able to visualize the model optimized to be run on the Ethos, any operations not supported by the NPU will be shown and carried out by the CPU, in the case of this simple example all the operations are carried out by the NPU:

cnn_tf_quant_vela.tflite.png

And we can now run the model on the target as follows:

./window -m cnn_tf_quant_vela.tflite -d /usr/lib/libethosu_delegate.so

Training a model with Pytorch

The repository contains a sample model using Convolutional Neural Networks to train on the MNIST data set, the model structure is as follows:

NeuralNetwork(
  (cnn): Sequential(
    (0): Conv2d(1, 16, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (1): ReLU()
    (2): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1))
    (3): ReLU()
    (4): Dropout(p=0.2, inplace=False)
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Flatten(start_dim=1, end_dim=-1)
    (7): Linear(in_features=5408, out_features=100, bias=True)
    (8): ReLU()
    (9): Dropout(p=0.2, inplace=False)
    (10): Linear(in_features=100, out_features=10, bias=True)
  )
)

Pytorch models can be easily converted to Tensorflow lite (without quantization) to be run on the CPU, as well as the Open Neural Network Exchange model (ONNX) however Executorch has recently been released, which is an inference model for pytorch models on embedded devices, support for it is currently in the works.

The Pytorch model is defined under pytorch_model.py:

#!/usr/bin/env python3

import torch
from torch import nn

# Define model
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.cnn = nn.Sequential(
            # Input 28x28x1, after padding 32x32x1, output 28x28x16
            nn.Conv2d(in_channels=1, out_channels=16, kernel_size=5, padding=2),
            nn.ReLU(),
            # Input 28x28x16, output 26x26x32
            nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3),
            nn.ReLU(),
            nn.Dropout(p=0.2),
            # Input 26x26x32, output 13x13x32
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Flatten(),
            nn.Linear(13*13*32, 100),
            nn.ReLU(),
            nn.Dropout(p=0.2),
            nn.Linear(100, 10)
        )

    def forward(self, x):
        logits = self.cnn(x)
        return logits

And the training is carried out by executing train_pytorch.py, the training achieves 99.3% accuracy on the test dataset and it takes around 7 min to complete on a normal laptop. For details on the framework itself and the training process refer to the official pytorch documentation.

The pytorch model is then saved under pytorch_model.pth however pytorch does not save the graph information only the weights and biases, if we visualize the saved model on netron or the eIQ toolkit model visualizer we can observe the disconnected weight and biases:

pytorch_model.pth.png

To better visualize our model we can simply convert it to the ONNX format by using the script pytorch2onnx.py, and now we can visualize the graph of our model on Neutron:

pytorch_cnn.onnx.png

NOTE: ONNX might also provide a way to quantize and convert the quantized model to tflite however in my tests of onnx-tf the tool seemed to be out of sync with the latest Tensorflow framewok, it was easier to create a similar model on Tensorflow and then quantize and export.

We can now export our model to tflite, since the model is not quantized it will be run on the CPU (XNN delegate), to export it we run pytorch2tflite.py and we can now visualize the exported model:

pytorch_cnn.tflite.png

And we can run this model on the target as follows:

./window -m pytorch_cnn.tflite

Deploying and running the model

We now have an application where we can draw the digits, a model capable of detecting those digits, but our application needs to be able to execute that model and get the results, this is our next step.

In order to be able to run the model on the target we need to:

Load the model
Create a tflite interpreter
Load external delegates if any
Allocate the tensors

C++ example

A minimal example is provided here, however it doesn't include the loading of the external delegate, which we will need in order to be able to run our model on the NPU.

The required headers are the following:

#include "tensorflow/lite/delegates/external/external_delegate.h"
#include "tensorflow/lite/interpreter.h"
#include "tensorflow/lite/interpreter_builder.h"
#include "tensorflow/lite/kernels/register.h"
#include "tensorflow/lite/model_builder.h"

We can now load our model as follows using the TFLite API:

std::unique_ptr<:flatbuffermodel> model =
      tflite::FlatBufferModel::BuildFromFile(model_path);

An interpreter needs to be created now, for this an operation resolver is needed as well as our model:

 tflite::ops::builtin::BuiltinOpResolver resolver;
 std::unique_ptr<:interpreter> interpreter;
 tflite::InterpreterBuilder(*model, resolver)(&interpreter);

If a delegate is required we now need to create it and update our execution graph so that the interpreter knows to call the delegate on the supported operations:

// Create external delegate option and pass the delegate library
TfLiteExternalDelegateOptions external_delegate_options =
        TfLiteExternalDelegateOptionsDefault(delegate_path);

// Create the External Delegate. This will load the delegate.
TfLiteDelegate *external_delegate =
        TfLiteExternalDelegateCreate(&external_delegate_options);

// Add External Delegate into TFLite Interpreter to automatically delegate nodes.
if (interpreter->ModifyGraphWithDelegate(external_delegate) != kTfLiteOk) {
      std::cerr << "Failed to add delegate" << std::endl;
}

We can now allocate the tensors for our model:

  // Allocate tensors for the model
  if (interpreter->AllocateTensors() != kTfLiteOk) {
    std::cerr << "Failed to allocate tensors" << std::endl;
  }

And at this point we are ready to run the inference using our model!

The last step is to fill the input buffers with our data, invoke the interpreter and retrieve the results from the output buffer, in the following example with a float model:

// Fill input buffers
// Note: The buffer of the input tensor with index `i` of type T can
// be accessed with `T* input = interpreter->typed_input_tensor(i);`
float *input_tensor = interpreter->typed_input_tensor(0);
std::memcpy(input_tensor, input.data(), input.size() * sizeof(float));

// Run inference
if (interpreter->Invoke() != kTfLiteOk) {
  std::cerr << "Failed to invoke Interpreter!" << std::endl;
  return {};
}

// Read output buffers
// Note: The buffer of the output tensor with index `i` of type T can
// be accessed with `T* output = interpreter->typed_output_tensor(i);`
float *output_tensor = interpreter->typed_output_tensor(0);
std::memcpy(output, output_tensor, output.size() * sizeof(float));

In our example application the interpreter creating and inference calling is wrapped in a class called NnModel, it's implementation can be seen on the repository but it can handle both the float models and int8 models without any modification.

The class is instantiated inside the main routine and the inference is called every time the predict button is clicked.

  // Create model with parsed parameters
  NnModel nn(model_path, delegate_path, verbose);

void Window::on_predict_clicked() {
  // Save screen to file
  std::cout << "Predict clicked!" << std::endl;
  // Call inference on NnModel depending on the type
  // the model expects
  int number;
  auto data_type = nn_.get_dtype();
  switch (data_type) {
    case kTfLiteFloat32: {
      std::vector drawing =
          mouse_drawing.export_to_vector(28, 28, 255.0);
      std::vector output_vec_f = nn_.infer(drawing);
      number = get_max_index(output_vec_f);
      break;
    }
    case kTfLiteInt8: {
      std::vector drawing =
          mouse_drawing.export_to_vector(28, 28, 255.0);
      std::vector output_vec_int = nn_.infer(drawing);
      number = get_max_index(output_vec_int);
      break;
    }
    default:
      std::cerr << "Cannot handle input type: " << std::to_string(data_type)
                << std::endl;
      break;
  }

  std::string display = "You drew a: " + std::to_string(number);
  std::cout << display << std::endl;
  text_view.set_text(display);
}

Python example

The process for creating an interpreter in Python is pretty similar, we still need to load a delegate if any is used and load the model as well as allocate the tensors.

In this example LiteRT is used instead however the API remains the same, the only change needed is where the interpreted is imported from.

The following minimal code can be used to load the model and any external delegates:

from ai_edge_litert.interpreter import Interpreter

# Create interpreter
if delegate_path is not None:
    # attempt to load external delegate if provided (platform specific)
    try:
        from ai_edge_litert.interpreter import load_delegate

        delegate = load_delegate(delegate_path)
        self.interpreter = Interpreter(model_path=model_path, experimental_delegates=[delegate])
    except Exception as e:
        raise RuntimeError(f"Failed to load delegate: {e}")
else:
    self.interpreter = Interpreter(model_path=model_path)

    self.interpreter.allocate_tensors()

We now have an interpreter we can use, we just need to fill the input tensors, invoke the interpreter and retrieve the output tensors with the results from our model:

# Set input
input_details = self.interpreter.get_input_details()[0]
self.interpreter.set_tensor(input_details['index'], input_data)
# Run inference
self.interpreter.invoke()

# Get results
out_details = self.interpreter.get_output_details()[0]
output_data = self.interpreter.get_tensor(out_details['index'])

These steps are contained in a wrapper class under nn_model.py.

Benchmarking the models

A prebuilt benchmarking tool is provided in the release it generates random inputs and measures the time it takes to run the inference on the model, the following are the results running the different models on the i.MX93:

./benchmaark_model --graph=model --num_threads=num_cores

	CPU 1 core	CPU 2 cores	NPU
pytorch_cnn.tflite	1559.61 us	1023.22 us	NA
cnn_tf_quant.tflite	585.37 us	379.69 us	NA
cnn_tf_quant_vela.tflite	NA	NA	221.84 us

This is of course a toy example but it can be observed how running on the dedicated hardware provides a significant improvement on inference speed.

The following is a guide on training a simple model in Pytorch and Tensorflow and deploying it on an application using the i.MX93 Ethos-65 Neural Processing Unit (NPU) and the i.MX95 eIQ Neutron NPU.