Content originally posted in LPCWare by jesari on Fri Jul 10 14:55:42 MST 2015
Hi,
I found the same issue in an LPC1114 (cortex-m0), and I think I've got an explanation for this behaviour:
- It seems the flash is read 16-bytes (8-halfwords) at a time. So, it takes 3 cycles for the reading, but then the 128 bits are latched and they can be read in a single cycle. Older LPCs, with MAM, also used 128-bit wide flashes and newer cortex-m processor are probably the same...
- The code loop is 2 halfwords long and, if it fits in a flash chunk, it will be executed at maximum speed because the flash is read only once. Following iterations fetch the op-codes directly from the latches.
- So, why if the loop is located in a 0xXXXXXXXA address the execution is slow? That's due to pipelining: When the BNE instruction is executed the PC is one halfword ahead reading a potential op-code to be executed. If this dummy fetch crosses the chunk boundary it forces a flash read that takes 2 extra cycles, and worse: the loop code is lost and has to be read from the flash again and again.
I'm not sure if this is really true, maybe some people at NXP know the details that are missing in the User Manuals and can confirm or deny it...