HI @sben,
1) How to ensure XNNPACK does not override the external (Neutron) delegate.
On NXP’s builds, XNNPACK is enabled by default and the runtime will try to execute models through XNNPACK unless you explicitly disable it. In NXP’s examples this is done with the --use_xnnpack=false switch; the guide states: “Models are executed via the XNNPACK Delegate by default … To run the example on the CPU without using the XNNPACK delegate, use the -–use_xnnpack=false switch.” In the same docs, the label_image/benchmark_model examples accept --use_xnnpack=false, and external delegates (like libneutron_delegate.so) are provided via --external_delegate_path. In your own C++ app you need to mirror that behavior: ensure you do not create or register the XNNPACK delegate (and if your app uses a delegate provider helper, pass the equivalent option to disable it). If your build of TFLite was compiled with XNNPACK “always on”, rebuild without XNNPACK or ensure your delegate‑provider layer propagates “use_xnnpack=false” before the interpreter is created. This is the supported way to prevent XNNPACK from touching the graph; otherwise the runtime will prefer it and can collide with Neutron custom nodes (as your log shows).
2) “If I deactivate the CPU and only run on the NPU, won’t performance be impacted? How can the NPU execute unsupported ops?”
In TFLite the execution model is hybrid by design: each delegate offloads only the partitions it supports, and everything else falls back to CPU (reference or XNNPACK kernels). The NXP guide is explicit: “The delegates are not required to support the full set of operators… unsupported operations fall back to CPU… the computational graph is divided into segments and each segment is executed via the delegate or on the CPU.” You generally do not (and cannot) disable the CPU; you let the Neutron delegate run its partitions and allow CPU kernels to handle the rest. Performance is not “worse because CPU is enabled”—it’s required for correctness. The performance knob is to maximize the fraction of the graph that the Neutron converter/delegate can capture, not to forcefully remove the CPU.
3) Why did YOLO work with both an external and XNNPACK delegate, but OSNet didn’t?
When you run a model that doesn’t contain Neutron custom nodes, XNNPACK can happily optimize CPU portions while the external delegate (if any) either has nothing to capture or coexists without conflict. In your failing case, the model has NeutronGraph custom ops inserted by the neutron‑converter. The Neutron delegate “captures operators and aggregates them as a neutronGraph node… and offloads the work to Neutron‑S”; it “only captures the neutronGraph node” in converted models. XNNPACK does not understand that custom op, and because it is applied after Neutron, it attempts to re‑analyze the graph and hits an unresolved custom op NeutronGraph—the exact error sequence you posted. That’s why YOLO (no NeutronGraph nodes in your converted variant) runs fine with both delegates, while OSNet does not. The fix is to disable XNNPACK when using a Neutron‑converted model so only the Neutron delegate handles its custom partitions and the remaining ops use reference CPU kernels.
4) Review of your INT8 quantization recipe (and why to double‑check it).
Your script uses full‑integer post‑training quantization with a representative dataset—good. NXP’s guide emphasizes that the representative dataset quality and size strongly affect accuracy and that the converter version matters; it also warns against dynamic range quantization for accelerators (weights‑only) because it forces fp32 activations and slows things down. A few checks to run given your symptoms:
NXP recommends using a converter aligned with the BSP’s TFLite (they cite 2.15.0/2.19.0 in releases) and notes that the MLIR converter may introduce dynamic tensor shapes by default, which are unsupported on accelerators; they even describe a workaround to disable unknown shapes. If your pipeline accidentally produced unknown/dynamic dims, Neutron conversion/partitioning can be affected. [i.MX Machi...- UG10166 | PDF]
You set inference_input_type = tf.float32 and inference_output_type = tf.float32 while targeting TFLITE_BUILTINS_INT8. That’s valid, but if your Neutron toolchain expects fully integer I/O for certain patterns, mismatches can surface at conversion or delegation time.
Re‑validate that your representative_dataset_gen truly matches runtime preprocessing (scales, means, ranges). NXP stresses the dataset choice/coverage as a hyperparameter with high impact on int8 quality.
Confirm NeutronGraph presence, Using --enable_op_profiling=true with the benchmark tool to see partitioning and verify that Neutron captures the intended partitions, and that only CPU kernels remain outside.
i.MX Machine Learning User's Guide