IMX95 NeutronGraph executed by XNNPACK

sben · ‎01-23-2026

Hello,

I'm trying to execute an osnet on IMX95.

What I did is :

Convert osnet_x1_0.onnx to osnet_x1_0.tf with onnx2tf
Convert the saved osnet_x1_0_float32.tf model to osnet_x1_0_int8.tflite with tf.lite.TFLiteConverter() in python
Using the neutron-converter --target imx95 --input osnet_x1_0_int8.tflite --output osnet_x1_0_converted.tflite to get NeutronGraph nodes

At this stage, the graph looks like this :

I'm then using c++ interpreter API to setup the NPU and XNNPACK delegates :

    TfLiteDelegate* ext_delegate = nullptr;
    TfLiteDelegate* xnn_delegate = nullptr;

    if (!ext_path.empty()) {
        std::cout << "INSIDE DELEGATE EXT" << std::endl;
        TfLiteExternalDelegateOptions opts =
            TfLiteExternalDelegateOptionsDefault(ext_path.c_str());
        TfLiteDelegate* ext_delegate = TfLiteExternalDelegateCreate(&opts);
        if (!ext_delegate) {
            std::cerr << "[ERROR]: Fail to create external delegate (" << ext_path << ")." << std::endl;
            return 1;
        }
        if (interpreter->ModifyGraphWithDelegate(ext_delegate) != kTfLiteOk) {
            std::cerr << "[ERROR]: Fail to apply external delegate." << std::endl;
            TfLiteExternalDelegateDelete(ext_delegate);
            return 1;
        } else {
            std::cout << "[OK] External Delegate applied." << std::endl;
        }
    }

    if (use_xnnpack) {
        std::cout << "INSIDE DELEGATE XNN" << std::endl;
        TfLiteXNNPackDelegateOptions xopts = TfLiteXNNPackDelegateOptionsDefault();
        xopts.num_threads = threads;
        TfLiteDelegate* xnn_delegate = TfLiteXNNPackDelegateCreate(&xopts);
        if (!xnn_delegate) {
            std::cerr << "[ERROR]: Fail to create XNNPACK." << std::endl;
            return 1;
        } else {
            if (interpreter->ModifyGraphWithDelegate(xnn_delegate) != kTfLiteOk) {
                std::cerr << "[ERROR]: Fail to apply XNNPACK." << std::endl;
                TfLiteXNNPackDelegateDelete(xnn_delegate);
                return 1;
            } else {
                std::cout << "[OK] XNNPACK applied." << std::endl;
            }
        }
    }

However, when run, the NeutronGraph get correctly delegated with the libneutron_delegate.so, but then get delegated again to the CPU :

model      : models/osnet_x1_0_conv.tflite
runs       : 100
warmup     : 10
fill       : rand
xnnpack    : ON
threads    : 1
ext delegate: /usr/lib/libneutron_delegate.so
INSIDE DELEGATE EXT
INFO: NeutronDelegate delegate: 2 nodes delegated out of 6 nodes with 2 partitions.

INFO: Neutron delegate version: v1.0.0-be8bf399, zerocp enabled.
INFO: Neutron delegate version: v1.0.0-be8bf399, zerocp enabled.
[OK] External Delegate applied.
INSIDE DELEGATE XNN
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
ERROR: failed to create XNNPACK runtime
ERROR: Node number 9 (TfLiteXNNPackDelegate) failed to prepare.
ERROR: Encountered unresolved custom op: NeutronGraph.
See instructions: https://www.tensorflow.org/lite/guide/ops_custom 
ERROR: Node number 1 (NeutronGraph) failed to prepare.
[ERROR]: Fail to apply XNNPACK.

Just to be sure, I tried with an osnet_x0_25 and after conversion, it looks like this :

But I'm getting the same issue with the CPU trying to grab something that is not meant for it.

I also tried with yolov8n, yolov8m and mobilenetv2 and I encounter no issues with these models, they run smoothly, which makes me think the code is not the issue.

I am working on an IMX95 with eIQ neutron-converter version 2.1.3+0Xaf140cf5.

Could you please explain to me what I'm doing wrong ?

Cordially,

Chavira · ‎02-02-2026

Hi @sben,

There is no supported global way to disable the automatic XNNPACK delegate loading while still keeping XNNPACK available system‑wide on NXP eIQ / TFLite.

View solution in original post

Chavira · ‎01-23-2026

Hi @sben,

Thank you for contacting NXP Support!

The issue shown happens because the model you converted for the IMX95 includes NeutronGraph custom operations, which are meant to be executed only by the Neutron delegate loaded from libneutron_delegate.so. The Neutron delegate is applied successfully delegating its expected partition but when XNNPACK is applied afterward, it attempts to re analyze the graph and ends up encountering the custom NeutronGraph ops, which it cannot interpret. This leads to errors such as “Encountered unresolved custom op: NeutronGraph” and causes XNNPACK to fail during graph preparation. Since TensorFlow Lite delegates do not coordinate with each other and later delegates can override earlier ones, the two delegate chain cannot work for models containing NeutronGraph nodes. For this reason, such models must be run only with the Neutron delegate, as XNNPACK is incompatible with the custom ops generated by the Neutron converter.

Additionally, it is worth double checking your INT8 conversion pipeline, especially the step using tf.lite.TFLiteConverter(). If the quantization or calibration process introduced unexpected patterns or unsupported transformations, this could further complicate how the Neutron converter produces NeutronGraph nodes. Ensuring that the INT8 conversion is clean and consistent may help avoid hidden issues before the Neutron delegate processes the model.

Best Regards,
Chavira

sben · ‎01-26-2026

Thanks for your reply @Chavira,

I already tried to setup only the NPU delegate. As you can see above, I tried to make an option, "use_xnnpack" that prevents the creation of the delegate if unspecified. However, even in the case where TfLiteXNNPackDelegateCreate isn't called, the XNNPACK still gets created. Which made sense to me since the builtins kernels of the CPU must be the default backend for the runtime.

1) How do I make sure the XNNPACK delegate doesn't override the external one ?

2) If I deactivate the CPU and only run on the NPU, won't the performance be impacted ? I don't understand how the NPU is supposed to execute operations that it shouldn't be able to translate.

3) Why isn't it needed to deactivate the XNNPACK for a model converted in the same way such as YOLO ? With YOLO, I can use my cpp with both an ext and xnn delegate, both created as mentioned above and it works fine.

4) My int8 quantization process goes as followed :

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir) # onnx2tflite directory
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = lambda: representative_dataset_gen(images_path, input_size)
converter.inference_input_type = tf.float32
converter.inference_output_type = tf.float32
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
tflite_model = converter.convert()

I quantized all my model as such and it worked fine (up until now). Do you see an issue with that method ?

Thank you for your time !

Cordially,

Chavira · ‎01-26-2026

HI @sben,

1) How to ensure XNNPACK does not override the external (Neutron) delegate.
On NXP’s builds, XNNPACK is enabled by default and the runtime will try to execute models through XNNPACK unless you explicitly disable it. In NXP’s examples this is done with the --use_xnnpack=false switch; the guide states: “Models are executed via the XNNPACK Delegate by default … To run the example on the CPU without using the XNNPACK delegate, use the -–use_xnnpack=false switch.” In the same docs, the label_image/benchmark_model examples accept --use_xnnpack=false, and external delegates (like libneutron_delegate.so) are provided via --external_delegate_path. In your own C++ app you need to mirror that behavior: ensure you do not create or register the XNNPACK delegate (and if your app uses a delegate provider helper, pass the equivalent option to disable it). If your build of TFLite was compiled with XNNPACK “always on”, rebuild without XNNPACK or ensure your delegate‑provider layer propagates “use_xnnpack=false” before the interpreter is created. This is the supported way to prevent XNNPACK from touching the graph; otherwise the runtime will prefer it and can collide with Neutron custom nodes (as your log shows).

2) “If I deactivate the CPU and only run on the NPU, won’t performance be impacted? How can the NPU execute unsupported ops?”
In TFLite the execution model is hybrid by design: each delegate offloads only the partitions it supports, and everything else falls back to CPU (reference or XNNPACK kernels). The NXP guide is explicit: “The delegates are not required to support the full set of operators… unsupported operations fall back to CPU… the computational graph is divided into segments and each segment is executed via the delegate or on the CPU.” You generally do not (and cannot) disable the CPU; you let the Neutron delegate run its partitions and allow CPU kernels to handle the rest. Performance is not “worse because CPU is enabled”—it’s required for correctness. The performance knob is to maximize the fraction of the graph that the Neutron converter/delegate can capture, not to forcefully remove the CPU.

3) Why did YOLO work with both an external and XNNPACK delegate, but OSNet didn’t?
When you run a model that doesn’t contain Neutron custom nodes, XNNPACK can happily optimize CPU portions while the external delegate (if any) either has nothing to capture or coexists without conflict. In your failing case, the model has NeutronGraph custom ops inserted by the neutron‑converter. The Neutron delegate “captures operators and aggregates them as a neutronGraph node… and offloads the work to Neutron‑S”; it “only captures the neutronGraph node” in converted models. XNNPACK does not understand that custom op, and because it is applied after Neutron, it attempts to re‑analyze the graph and hits an unresolved custom op NeutronGraph—the exact error sequence you posted. That’s why YOLO (no NeutronGraph nodes in your converted variant) runs fine with both delegates, while OSNet does not. The fix is to disable XNNPACK when using a Neutron‑converted model so only the Neutron delegate handles its custom partitions and the remaining ops use reference CPU kernels.

4) Review of your INT8 quantization recipe (and why to double‑check it).
Your script uses full‑integer post‑training quantization with a representative dataset—good. NXP’s guide emphasizes that the representative dataset quality and size strongly affect accuracy and that the converter version matters; it also warns against dynamic range quantization for accelerators (weights‑only) because it forces fp32 activations and slows things down. A few checks to run given your symptoms:

NXP recommends using a converter aligned with the BSP’s TFLite (they cite 2.15.0/2.19.0 in releases) and notes that the MLIR converter may introduce dynamic tensor shapes by default, which are unsupported on accelerators; they even describe a workaround to disable unknown shapes. If your pipeline accidentally produced unknown/dynamic dims, Neutron conversion/partitioning can be affected. [i.MX Machi...- UG10166 | PDF]

You set inference_input_type = tf.float32 and inference_output_type = tf.float32 while targeting TFLITE_BUILTINS_INT8. That’s valid, but if your Neutron toolchain expects fully integer I/O for certain patterns, mismatches can surface at conversion or delegation time.

Re‑validate that your representative_dataset_gen truly matches runtime preprocessing (scales, means, ranges). NXP stresses the dataset choice/coverage as a hyperparameter with high impact on int8 quality.

Confirm NeutronGraph presence, Using --enable_op_profiling=true with the benchmark tool to see partitioning and verify that Neutron captures the intended partitions, and that only CPU kernels remain outside.

i.MX Machine Learning User's Guide

sben · ‎01-29-2026

Hello @Chavira ,

Thanks for your answer! I managed to manually build TFLite with -DTFLITE_ENABLE_XNNPACK=OFF and the model now runs fine. I though this option would just disable the automatic setup of XNNPack on runtime. However, this completely disables XNNPACK, even when I would like to use it explicitly in my own C++ code (TfLiteXNNPackDelegateCreate).

What I would actually need is the ability to :

prevent XNNPACK from being loaded or applied automatically by the runtime,
but still keep XNNPACK available so I can manually create the delegate when I want to use it.

Is it not possible or did I do something wrong ?

Right now, with the NXP build, XNNPACK is loaded as a delegate plugin even if I never instantiate it in my code. If I understood what you said correctly, the plugin inspects the graph and fails on the NeutronGraph custom op inserted by the Neutron converter.

Disabling XNNPACK at build time fixes the issue, but removes all CPU acceleration for my models - like yolov8n - that were fine before. And if I remove the plugin .so, TFLite fails to load because it expects the plugin to exist. I had to create a dummy libxnnpack_delegate.so that returns nullptr just to avoid the crash.

TLDR, my question is :

Is there a supported way to disable the automatic loading of the XNNPACK delegate plugin, while still keeping the XNNPACK backend available for explicit, manual delegate creation in my application ?

Thanks again for your help.

Cordially,

Chavira · ‎02-02-2026

Hi @sben,

There is no supported global way to disable the automatic XNNPACK delegate loading while still keeping XNNPACK available system‑wide on NXP eIQ / TFLite.