Hi,
I will use simple example only: Let's assume we have very short loop - four 32bit instructions only. This is 128bits in total and it is the width of flash line buffer. Let's also assume that no speculative read is enabled (read 35.5.2 Speculative reads in the reference manual for more details).
If this loop is in aligned to 128bit, whole loop fits one flash line. Once it is loaded to flash line buffer, it can be read in single cycle. So, when the core is fetching instructions again and again in a loop, it fetches them in single cycle.
If we shift the loop, so two instructions are in one 128bit frame and next two instructions are in next 128bit frame, then reading of physical flash array is needed during execution. First two instructions are executed, then there's buffer miss, so next line of flash is loaded, next two instructions are executed and it jumps back but there's also buffer miss... and this is going again and again.
So, the cache memory helps here a lot.
Regards,
Lukas