TFLM Inference Issue: Unexpected Outputs with Large ASR Model in SDRAM (i.MX RT1170 EVK)

priyesh_shahi · ‎05-19-2026

Hello NXP Support Team,

We are experiencing an issue with inference accuracy/behavior while running a large Acoustic Speech Recognition (ASR) model using TensorFlow Lite for Microcontrollers (TFLM) on the i.MX RT1170.

Because the model and the required tensor arena exceed internal SRAM, both are explicitly placed into external SDRAM. The application runs smoothly without any memory faults or crashes, but the mathematical output of the model is incorrect.

Environment:
MCU: MIMXRT1170 (Cortex-M7)

OS: FreeRTOS

Framework: eIQ / TFLM

Memory: Model and Tensor Arena residing entirely in external SDRAM

Problem Description:
We load the model data into the external SDRAM and configure the MPU.

The TFLM MicroInterpreter is initialized, and AllocateTensors() completes successfully (kTfLiteOk).

We feed pre-computed float features into the input tensor. To ensure DMA/memory coherency, we manually invalidate the DCache for the input buffer prior to inference.

interpreter.Invoke() executes and completes successfully with no crashes or HardFaults.

The Issue: The output logits returned by the interpreter are completely incorrect (or blank) compared to running the exact same model and input data on a desktop PC.

Troubleshooting Steps Taken:
Input Verification: Verified that the byte-for-byte input data fed into the MCU tensor exactly matches the input used in our successful PC Python baseline.

Cache Management: Configured the MPU for the SDRAM region and ensured DCache is invalidated before inference so the Cortex-M7 doesn't read stale memory.

OpResolver: Verified that every TFLite operation required by the model is registered and supported.

Questions for the Team:
Are there known precision loss issues or mathematical discrepancies with NXP's TFLM implementation when heavy matrix operations execute entirely out of external SDRAM?

Does the Cortex-M7 require a specific MPU attribute configuration (e.g., Strongly Ordered vs. Normal memory) for the external SDRAM to ensure TFLM calculates the weights/activations accurately?

Are there specific eIQ hardware-accelerator flags that should be disabled if we are forcing TFLM to run operations outside of the tightly coupled memory (DTC/ITC)?

Any guidance on debugging this discrepancy between the MCU output and the PC baseline would be greatly appreciated.

Thank you,
Priyesh shahi

EdwinHz · ‎06-01-2026

Hi ,

From the description you shared, it seems like the issue would likely be related to a cache coherency issue rather than to a numerical precision limitation. Rather than invalidating the DCache for the input buffer prior to inference, I recommend using non-cacheable regions for shared/DMA-updated buffers. The following application note describes in more detail how to set up and use non-cacheable memory regions: Using NonCached Memory on i.MXRT. Please try using this method instead and let me know if it helps.

Addressing your specific questions:

1. We do not have any documented data that describes precision loss issues when implementing heavy matrix operations entirely on an external SDRAM.

2. For external SDRAM used as normal data configuring it as normal memory would be the recommendation in order to ensure it can be configured as non-cacheable.

3. There isn't any specific accelerator flags documented as the correct path to ensure proper execution of the TFLM operations outside the internal RAM. This is another reason why the issue is more likely related to the cache memory handling rather than the actual operation of the model.

BR,
Edwin.

priyesh_shahi · ‎06-08-2026

Hi Edwin,

We have successfully isolated the issue and can confirm it is a mathematical corruption bug within the CMSIS-NN optimized kernels (specifically affecting dilated convolutions, which our ASR model relies on heavily).

Our Proof:
1. We recompiled "libtflm.a" using purely standard C++ Reference Kernels (omitting the "cmsis_nn" directory during compilation) while keeping our SDRAM cache active and using safe DTCM buffers for DMA transfers.
2. With CMSIS-NN bypassed, the model immediately began successfully decoding correct speech tokens. For example, Chunk 7 correctly outputs "[ easy]", and Chunk 10 correctly outputs "[ ch]".
3. When we revert back to the precompiled CMSIS-NN library, the execution speed drops to 537 ms, but the output is completely corrupted back to empty brackets across all chunks.

This confirms our memory setup, cache configuration, and DMA buffers are completely correct, and that a mathematical error is occurring within the CMSIS-NN optimized convolution/depthwise convolution paths when processing non-unity dilation (dilation rate greater than 1).

Our Questions:
- Which version of the eIQ SDK / CMSIS-NN middleware contains the official bugfix for dilated convolutions (dilation rate greater than 1) in arm_convolve_s8 and arm_depthwise_conv_s8?
- Can you provide us with a patch or updated libtflm.a that has corrected CMSIS-NN kernels so we can achieve both full accuracy and the optimized 537 ms inference speed?

Regards,
Priyesh shahi