Hi ,
From the description you shared, it seems like the issue would likely be related to a cache coherency issue rather than to a numerical precision limitation. Rather than invalidating the DCache for the input buffer prior to inference, I recommend using non-cacheable regions for shared/DMA-updated buffers. The following application note describes in more detail how to set up and use non-cacheable memory regions: Using NonCached Memory on i.MXRT. Please try using this method instead and let me know if it helps.
Addressing your specific questions:
1. We do not have any documented data that describes precision loss issues when implementing heavy matrix operations entirely on an external SDRAM.
2. For external SDRAM used as normal data configuring it as normal memory would be the recommendation in order to ensure it can be configured as non-cacheable.
3. There isn't any specific accelerator flags documented as the correct path to ensure proper execution of the TFLM operations outside the internal RAM. This is another reason why the issue is more likely related to the cache memory handling rather than the actual operation of the model.
BR,
Edwin.