I did some rigorous tests on the FRDM-K64F, with the K64F chip running at 120MHz, the bus and peripheral clocks at 60MHz, and the flash at 24MHz, with prefetching enabled. I used the PIT to time the following code, inside a loop so that the PIT accesses could be factored out. The clock estimates are from the Cortex-M4 tech ref manual:
.L1: ldr r4,[r3] ; 2 clocks \ overlapped
str r4,[r3] ; 2 clocks / by 1 clock
; repeat 99 more times, for a total of 300 clocks
subs r2,r2,#1
bne .L1
I also timed the following code which runs many more data cycles, leaving more time for the flash prefetch unit to refill:
.L2: ldm r3,{r4,r5,r6,r7,r8,r9,r10,r11} ; 9 clocks
stm r3,{r4,r5,r6,r7,r8,r9,r10,r11} ; 9 clocks
; repeat 99 more times, for a total of 1800 clocks
subs r2,r2,#1
bne .L2
The clock counts were:
instrs code data clocks overhead
------------------------------------------
ldr/str FLASH SRAM_L 410 110
FLASH SRAM_U 310 10
SRAM_L SRAM_L 402 102
SRAM_L SRAM_U 302 2
SRAM_U SRAM_L 304 4
SRAM_U SRAM_U 405 105
ldm/stm FLASH SRAM_L 1815 15
FLASH SRAM_U 1810 10
SRAM_L SRAM_L 1904 104
SRAM_L SRAM_U 1804 4
SRAM_U SRAM_L 1806 6
SRAM_U SRAM_U 2106 306
The clocks count is the number of clocks per loop. The overhead is the clocks count minus the minimum 300 or 1800 for the basic instructions, and includes the few clocks for looping, plus whatever wait states are incurred during the long ldr/str or ldm/stm sequences.
The first thing this shows is that it is not true that there is a general 1-clock penalty for code accesses to SRAM_U. The most efficient arrangement is code in SRAM_L and data in SRAM_U, but swapping them is essentially the same. Furthermore, code in flash and data in SRAM_U is essentially the same, too. More interesting is that for ldr/str, which doesn't leave much time for prefetch refilling, code in flash and data in SRAM_L is much slower, comparable to having code and data in the same RAM. And having code and data both in SRAM_U is by far the slowest.
This suggests that if you want most code in flash, you should first put your data in SRAM_U, and if you run out, move non-speed-critical data into SRAM_L. You might be able to squeeze out a little more speed by placing some hot code in SRAM_L, as long as its data is in SRAM_U.
This isn't very intuitive, and doesn't seem to agree at all with the implications of the "Optimizing Performance on Kinetis K-series MCUs" manual. If anyone can see any flaws in my analysis, or have made any similar tests, let me know.