Hello,
As mentioned in the title, can the NPU in the mcu support transformer based model? From the official website, it was claimed that the NPU can support transformer models.
Assuming with compression, quantization and external RAM through flexspi, a small transformer based model can be fitted on board. Can the NPU actually be used to accelerate the inference?
An example would be GitHub - maxbbraun/llama4micro: A "large" language model running on a microcontroller
Instead of running on the CPU, can the model be ran on the NPU?
How can such a use case be adapted to MCXN?
Hi @TomC818
Full “LLM on NPU” isn’t a supported path yet on MCX.
LLMs rely on dynamic sequences, KV caches, etc.—many are outside the current eIQ Neutron TFLite op set.
BR
Harry