Internal SRAM wait states

paulderocco · ‎08-27-2014

I did a K70 project a few months ago, and somewhere along the line read that the core had single-clock access to the lower internal SRAM and two-clock access to the upper internal SRAM, so I placed certain non-speed critical stuff in upper SRAM. Now I'm starting out on a K64 project, and I don't see any reference to different core access speeds to the two SRAM areas. Furthermore, going back to the K70 docs, I can no longer find it there, either. Did I dream that? Is there or isn't there a performance hit when accessing the upper SRAM area or not?

LuisCasado · ‎08-28-2014

Hi Paul,

Not in your dreams. Probably you read this AN J

http://cache.freescale.com/files/microcontrollers/doc/app_note/AN4745.pdf

"All Kinetis K-series devices include two blocks of on-chip SRAM. The first block (SRAM_L) is mapped to the CODE bus, and the second block (SRAM_U) is mapped to the system bus. The memory itself can be accessed in a single cycle, but

because instruction accesses to the system bus incurs a one clock delay at the core, SRAM_U instruction accesses take at least two clocks.”

… For this reason, it makes sense to use the SRAM_L block as much as possible. This is a good area for storing critical code.

Regards,

Luis

paulderocco · ‎08-28-2014

Yes! That's exactly where I read it. Thanks.

Yet now that I reread it, it's not clear if "instruction accesses" refers only to instruction fetches, or also to load/store data accesses. The next paragraph says "SRAM_L is the only memory where code or data can be stored and the core is almost always guaranteed a single cycle access" (my emphasis). But the following paragraph (3.1) says "For a typical application, placing critical code in the SRAM_L block and placing data and the stack into the SRAM_U block will yield the best performance." If there was a penalty for load/store accesses to SRAM_U, wouldn't you get the best performance by putting both code and data in SRAM_L?

Loads and stores are 2-clock instructions in the core. But I don't know if this means that the ICODE bus fetch and the DCODE bus data access are naturally interleaved, or if they can still collide in the mux onto the CODE bus. If the latter, 3.1 could just mean that if code is in SRAM_L, then it is better to put data in SRAM_U to avoid the ICODE/DCODE conflict. In that case, if my code is in flash and accessed via the system bus, it might still be better to put data in SRAM_L.

Does anyone else have any further insight, or do I need to run a bunch of timed loops?

paulderocco · ‎08-28-2014

I did some rigorous tests on the FRDM-K64F, with the K64F chip running at 120MHz, the bus and peripheral clocks at 60MHz, and the flash at 24MHz, with prefetching enabled. I used the PIT to time the following code, inside a loop so that the PIT accesses could be factored out. The clock estimates are from the Cortex-M4 tech ref manual:

.L1:    ldr r4,[r3] ; 2 clocks \ overlapped
        str r4,[r3] ; 2 clocks / by 1 clock
        ; repeat 99 more times, for a total of 300 clocks
        subs r2,r2,#1
        bne .L1

I also timed the following code which runs many more data cycles, leaving more time for the flash prefetch unit to refill:

.L2:    ldm r3,{r4,r5,r6,r7,r8,r9,r10,r11} ; 9 clocks
        stm r3,{r4,r5,r6,r7,r8,r9,r10,r11} ; 9 clocks
        ; repeat 99 more times, for a total of 1800 clocks
        subs r2,r2,#1
        bne .L2

The clock counts were:

instrs    code    data    clocks overhead
------------------------------------------
ldr/str   FLASH   SRAM_L   410    110
          FLASH   SRAM_U   310     10
          SRAM_L SRAM_L   402    102
          SRAM_L SRAM_U   302      2
          SRAM_U SRAM_L   304      4
          SRAM_U SRAM_U   405    105
ldm/stm   FLASH   SRAM_L 1815     15
          FLASH   SRAM_U 1810     10
          SRAM_L SRAM_L 1904    104
          SRAM_L SRAM_U 1804      4
          SRAM_U SRAM_L 1806      6
          SRAM_U SRAM_U 2106    306

The clocks count is the number of clocks per loop. The overhead is the clocks count minus the minimum 300 or 1800 for the basic instructions, and includes the few clocks for looping, plus whatever wait states are incurred during the long ldr/str or ldm/stm sequences.

The first thing this shows is that it is not true that there is a general 1-clock penalty for code accesses to SRAM_U. The most efficient arrangement is code in SRAM_L and data in SRAM_U, but swapping them is essentially the same. Furthermore, code in flash and data in SRAM_U is essentially the same, too. More interesting is that for ldr/str, which doesn't leave much time for prefetch refilling, code in flash and data in SRAM_L is much slower, comparable to having code and data in the same RAM. And having code and data both in SRAM_U is by far the slowest.

This suggests that if you want most code in flash, you should first put your data in SRAM_U, and if you run out, move non-speed-critical data into SRAM_L. You might be able to squeeze out a little more speed by placing some hot code in SRAM_L, as long as its data is in SRAM_U.

This isn't very intuitive, and doesn't seem to agree at all with the implications of the "Optimizing Performance on Kinetis K-series MCUs" manual. If anyone can see any flaws in my analysis, or have made any similar tests, let me know.

Internal SRAM wait states

Internal SRAM wait states

Kinetis K Series MCUs