Hi Peter,
I'm getting an unexpected interrupt at some point in the 600 instruction run, so I haven't been able to duplicate your results. However, I think I have seen enough to get an idea of what is going on...
The way the arm_for_loop_flash() function is written each loopsize you are testing uses a different physical location for the loop that is tested. More importantly, the alignment for each loop also moves around because of the case statement checking. To show you what I'm talking about here the address of the first add instruction in some of the loops:
100 instruction 0x532
200 instruction 0x612
300 instruction 0x7BC
400 instruction 0xA2E
500 instruction 0xD68
As I mentioned before, the alignment is a factor in how the cache and FMC work. The cache works on 16-byte lines, and some of the code from main would be cached along with your test loop and also the case statement. Because the case statement code is pretty short, you're also caching sections of the test loop for smaller instruction sizes. The fact that the larger test loops where you are starting to run out of cache space are at the end of the case statement also doesn't help (as you've bot more case statement code to work through and extra adds near the case statement checks that are also going to find their way into the cache). If you reversed the order of the case statement, then I think that might help to move the drop off point for the cache enabled case.
I suspect that when you add in code to enable the cache it is changing the alignment of the test loops as compared to the FMC only case. I think that is why you are seeing the cache performance line drop down below the FMC only performance when you hit the point where the cache is full. The FMC has a very small cache, so changes to code alignment have a big effect on how efficient the FMC is at preventing the core from seeing wait states to the flash. The FMC on this particular device holds up to 16 entries that are each 128-bit, so you can cache up to 256 bytes of code. If you have a control loop in your application that is 248 bytes, should be no problem, right? But if the first byte of the loop is not at a 128-bit alignment, then that throws everything off. You're left with some code in one line, and trailing code in one line that doesn't quite fit in the cache. So now you have to swap lines in and out of the cache and you get wait states.
To sum up there are some things going on that I think are keeping your test from being an apples to apples comparison at least for the different modes and loop sizes on Kinetis. I think this explains some of the strange things you are seeing in your Kinetis results.
Hope this extra information helps.
Regards,
Melissa