TFLM Inference Issue: Unexpected Outputs with Large ASR Model in SDRAM (i.MX RT1170 EVK)

priyesh_shahi

Hello NXP Support Team,

We are experiencing an issue with inference accuracy/behavior while running a large Acoustic Speech Recognition (ASR) model using TensorFlow Lite for Microcontrollers (TFLM) on the i.MX RT1170.

Because the model and the required tensor arena exceed internal SRAM, both are explicitly placed into external SDRAM. The application runs smoothly without any memory faults or crashes, but the mathematical output of the model is incorrect.

Environment:
MCU: MIMXRT1170 (Cortex-M7)

OS: FreeRTOS

Framework: eIQ / TFLM

Memory: Model and Tensor Arena residing entirely in external SDRAM

Problem Description:
We load the model data into the external SDRAM and configure the MPU.

The TFLM MicroInterpreter is initialized, and AllocateTensors() completes successfully (kTfLiteOk).

We feed pre-computed float features into the input tensor. To ensure DMA/memory coherency, we manually invalidate the DCache for the input buffer prior to inference.

interpreter.Invoke() executes and completes successfully with no crashes or HardFaults.

The Issue: The output logits returned by the interpreter are completely incorrect (or blank) compared to running the exact same model and input data on a desktop PC.

Troubleshooting Steps Taken:
Input Verification: Verified that the byte-for-byte input data fed into the MCU tensor exactly matches the input used in our successful PC Python baseline.

Cache Management: Configured the MPU for the SDRAM region and ensured DCache is invalidated before inference so the Cortex-M7 doesn't read stale memory.

OpResolver: Verified that every TFLite operation required by the model is registered and supported.

Questions for the Team:
Are there known precision loss issues or mathematical discrepancies with NXP's TFLM implementation when heavy matrix operations execute entirely out of external SDRAM?

Does the Cortex-M7 require a specific MPU attribute configuration (e.g., Strongly Ordered vs. Normal memory) for the external SDRAM to ensure TFLM calculates the weights/activations accurately?

Are there specific eIQ hardware-accelerator flags that should be disabled if we are forcing TFLM to run operations outside of the tightly coupled memory (DTC/ITC)?

Any guidance on debugging this discrepancy between the MCU output and the PC baseline would be greatly appreciated.

Thank you,
Priyesh shahi

EdwinHz

Hi ,

From the description you shared, it seems like the issue would likely be related to a cache coherency issue rather than to a numerical precision limitation. Rather than invalidating the DCache for the input buffer prior to inference, I recommend using non-cacheable regions for shared/DMA-updated buffers. The following application note describes in more detail how to set up and use non-cacheable memory regions: Using NonCached Memory on i.MXRT. Please try using this method instead and let me know if it helps.

Addressing your specific questions:

1. We do not have any documented data that describes precision loss issues when implementing heavy matrix operations entirely on an external SDRAM.

2. For external SDRAM used as normal data configuring it as normal memory would be the recommendation in order to ensure it can be configured as non-cacheable.

3. There isn't any specific accelerator flags documented as the correct path to ensure proper execution of the TFLM operations outside the internal RAM. This is another reason why the issue is more likely related to the cache memory handling rather than the actual operation of the model.

BR,
Edwin.