Hi Ari.
KV58 is Cortex M7 based – i.e. has Harvard architecture buses:
-CODE bus (optimized for access to instruction - instruction fetch), through which access to the flash (instruction - cache), also to the I-TCM (something like RAM)
- System Bus (optimized for data access - Data Access), through which access to the D0-TCM, D1-TCM, peripherals etc.
Both buses are cached, i.e. both buses have access to data of slower memories (everything except TCM memory) which are cacheable. Data are stored in the cache only in certain circumstances.
Depends on the occurrence, repeatability access to this data, it is possible to locked content on some cache memories, i.e. to-cached specific code/data and the region cache consequently locked.
It looks that your code on which monitors the performance does not perform from cache but flash memory (i.e. if there are some branch instructions – wait states)
The code is executed from flash memory because is performed only once (it could be cached during its performing – in fact that it is in cache until it was erased by some other event) but due to to the fact that it was executed only once, it is acted as if it is done only from flash).
I consulted your issue with the Application team and for optimum use of core M7 performance we suggest to you store critical code into the I-TCM memory (this is a quick memory intended exclusively for the code – Instruction Tightly-Coupled-Memory) and static data to the D0-TCM and dynamic (stack) to the D1-TCM. Thus, the maximum power is achieved, i.e. CODE bus accesses (fetch) to instructions which are stored in I-TCM and also SYSTEM bus can access to data which are stored in D0-TCM or D1-TCM.
In case that the code is stored in Flash memory, is needed to enable I-CACHE and consequently is required repeatability of the code, i.e. to perform more then 1 time. It could be also possible by the way that the code is performed by force – i.e. just the code cached) and then the cache is locked and other code performing does not have to access to the flash, but it is performed directly from a cache memory.
Also, it is often used data from peripherals directly in the calculation is needed to enable also D-CACHE (it could be also useless in some cases – for example that data are used from ADC is often necessary anyway update).
In case of any issue, please let me know.
Best regards,
Iva