low flash performance on K6x

peterruesch · ‎07-16-2015

Hi,

we are currently benchmarking several relevant controllers to evaluate if they fit our needs and how they compare.

First, we were really happy to have found the kinetis K65 has an internal L1 cache to boost its performance.

I did some benchmarking and the code execution performance from internal flash to get an idea of the real performance one can expect form plain code execution.

please find the resulting charts below which show the results of my benchmarking. The Y axis indicates the execution efficency. This is reduced by occuring waitstates of the internal flash.

the benchmark consists of running through a defined number of instructions and counting cpu cycles meanwhile. Increasing difference between instructions and cpu cycles indicate a bottleneck in code fetching from the internal flash.

my conclusion: ST has done sone good job on their ART-Cache. How can I get equally good results with a K65 Part?

The worst thing is, that the K65 L1 Cache could possibly limit the performance of the controller :smileysad:

What surprises me: even without prefetching and cache stuff enabled, the STM32F429 still outperforms the K65 even with activated caches in some circumstances.

please let me know your thoughts!

melissa_hunter · ‎07-21-2015

Hi Peter,

I'd like to get more information about your setup to try and get a better understanding of the numbers you are seeing. It does make sense that the FMC+L1 is the best option, but the code cache is larger than 2k, so I'd expect the drop-off in performance to be later and it should drop down to the FMC only level instead of being even worse than the FMC only option. So I think that something strange is going on with the tests, but I don't have an immediate idea as to what that would be.

Could you tell me more about the test software you are using? Is the code loop a single instruction repeated the specified number of times or do you have a mix of instructions? Are there data accesses associated with any of the instructions? If so, where is the target data stored? How many times are you running through the loop for your test?

It would also be good to know how the FMC and cache are configured for each of your four test cases. I'm guessing you are using the default region values for the cache, but what value are you using for the LMEM_PCCCR?

Thanks,

Melissa

peterruesch · ‎07-22-2015

Hi Melissa,

thank you very much for your attention:smileyhappy:

I will try to explain my doing in the following lines:

Melissa A Hunter:
but the code cache is larger than 2k, so I'd expect the drop-off in performance to be later and it
should drop down to the FMC only level instead of being even worse than the FMC only option.

Thats right, cache size is 8K. But each assembly instruction is encoded in a 32bit Opcode --> 2k instructions means 8k code size. Dropoff at arround 2k instructions seems perfectly fine for me.
edit: the used instructions result in a 16bit opcode, so I'm confused about the early drop-off of performance, too. Does this mean the caching mechanics dont distinguish between 16- and 32bit opcodes?

The drop below FMC only level surprises me, too. Must have something to do with the inner workings of the cache which I have not yet understood or not yet found information about. Running tests with code sizes which fit in the cache, i can see a higher cycle count in the first run, and a lower for succesive runs. In my understanding, this unveils a cache fill overhead and this is reason for the performance to drop below FMC only level. Please correct me if I'm wrong here. This is only speculation. LMEM_PCCCR[PCCR3] (Forces no allocation on cache misses (must also have PCCR2 asserted) must be there for something I guess...

Melissa A Hunter:
Is the code loop a single instruction repeated the specified number of times or do you have a mix of instructions?
Are there data accesses associated with any of the instructions? If so, where is the target data stored?
How many times are you running through the loop for your test?

Yes, its a single instrction repeated a given number of times. A little mix emerges from the function call overhead. To give you an idea what I'm doing, please see the following picture:

No data is accessed with the repeated instructions. Only a loop counter form the for-loop is used as "data"
the loop containing the repeated instructions is runned through 10.000 times. The results are averaged.

Melissa A Hunter:
It would also be good to know how the FMC and cache are configured for each of your four test cases.
I'm guessing you are using the default region values for the cache, but what value are you using for the LMEM_PCCCR?

FMC active: reset settings from controller. no changes to any register.
FMC inactive: disabling prefetching and instruction cache in the FMC register set. I have highlighted the register I modify below.

LMEM regions: yes I'm using the default regions.
LMEM_PCCCR: default values. Activation of cache taken from AN4745:
- Set the LMEM_PCCCR[INVW1 and INVW0]
- Set the LMEM_PCCCR[GO] bit to start the invalidate.
- Wait for the LMEM_PCCCR[GO] bit to clear.
- Enable the CODE bus cache by setting LMEM_PCCR[ENCACHE].

Melissa A Hunter:
Thanks,
Melissa

thank YOU!

Hope you can use my update to give me some more information.

Happy to hear from you :smileyhappy:

Best regards

Peter

melissa_hunter · ‎07-27-2015

Hi Peter,

Sorry for the delay in getting back to you. I was out on vacation for a few days last week, so I just saw your reply today.

I don't see anything obvious in what you are doing that explains the results. The performance can be affected by very small things though. Just adding/removing lines of code to control the cache can change the code alignment which can impact how efficiently the cache and FMC are able to load instructions. I suspect there might be differences in the code alignment between your FMC and L1 vs. FMC-only test cases that might be accounting for the differences you are seeing with the larger loops.

Could you share the project you are using? If I can duplicate your tests here that might be the easiest way for me to help. From your screenshots it looks like you are using KDS for your testing on Kinetis? Is that right? If so, out of curiosity, what tools are you using with the ST part?

Thanks,

Melissa

peterruesch · ‎08-03-2015

Hi Melissa,

I also was out on an expedition the last week so I saw your post just now. Hope you had a great time!

Yes, I'm using KDS.

For the ST parts, I'm using gcc 4.9.

Please find attatched the KDS project I am using to do the benchmarks.

Best regards

Peter

melissa_hunter · ‎08-05-2015

Hi Peter,

I'm getting an unexpected interrupt at some point in the 600 instruction run, so I haven't been able to duplicate your results. However, I think I have seen enough to get an idea of what is going on...

The way the arm_for_loop_flash() function is written each loopsize you are testing uses a different physical location for the loop that is tested. More importantly, the alignment for each loop also moves around because of the case statement checking. To show you what I'm talking about here the address of the first add instruction in some of the loops:

100 instruction 0x532

200 instruction 0x612

300 instruction 0x7BC

400 instruction 0xA2E

500 instruction 0xD68

As I mentioned before, the alignment is a factor in how the cache and FMC work. The cache works on 16-byte lines, and some of the code from main would be cached along with your test loop and also the case statement. Because the case statement code is pretty short, you're also caching sections of the test loop for smaller instruction sizes. The fact that the larger test loops where you are starting to run out of cache space are at the end of the case statement also doesn't help (as you've bot more case statement code to work through and extra adds near the case statement checks that are also going to find their way into the cache). If you reversed the order of the case statement, then I think that might help to move the drop off point for the cache enabled case.

I suspect that when you add in code to enable the cache it is changing the alignment of the test loops as compared to the FMC only case. I think that is why you are seeing the cache performance line drop down below the FMC only performance when you hit the point where the cache is full. The FMC has a very small cache, so changes to code alignment have a big effect on how efficient the FMC is at preventing the core from seeing wait states to the flash. The FMC on this particular device holds up to 16 entries that are each 128-bit, so you can cache up to 256 bytes of code. If you have a control loop in your application that is 248 bytes, should be no problem, right? But if the first byte of the loop is not at a 128-bit alignment, then that throws everything off. You're left with some code in one line, and trailing code in one line that doesn't quite fit in the cache. So now you have to swap lines in and out of the cache and you get wait states.

To sum up there are some things going on that I think are keeping your test from being an apples to apples comparison at least for the different modes and loop sizes on Kinetis. I think this explains some of the strange things you are seeing in your Kinetis results.

Hope this extra information helps.

Regards,

Melissa

peterruesch · ‎08-06-2015

Melissa,

thank you for taking the time to write this answer.

You have perfectly described why the performance drops when the looped code size is larger than the cached instructions. this is obvious.

In a real world scenario, with lots of legacy code (super loop), I have no canche to influence the way the code is run and so I'm faced with suboptimal conditions in which the performance drastically drops. My concern is how I can avoid this massive drop in performance unsing a K6x part. ST seems to do things more cleverly at this point as it seems they don't need extra cycles to allocate instructions in their cache. Please correct me if I'm wrong!

Can you please comment on the behavior of the K6x devices LMEM cache ? In terms of extra cycles on a occuring cache allocation in addition to the unavoidable flash access wait states...

I'm running the exact same tests on the ST and Kinetis part which produces numbers to be compared like apples to apples.

If I got your post right, you can't identify a misusing of the cache infrastructure. So the results are as good as they can get. In a realworld scenario, where no such tight loops exist, the results might be even worse :smileysad:---

to clarify: i enable and disable the LMEM cache view the embsysregview view in eclipse during the debug session.

Peter

melissa_hunter · ‎08-07-2015

Hi Peter,

I think I've solved at least part of the mystery...

I found an error in your macro definition for your 4k loop. You accidentally put in 5 instances of the 1k loop instead of 4. So that explains why you are seeing a decrease in performance across the board at the 4k point. If you make this change that should push out the point where the cache + FMC performance starts to decrease. I expect the drop in performance to move out so that it is at the 8k point to correspond to the Code cache size (it might actually decrease earlier because of the case statement code exposing the associativity of the cache).

I also double checked with some of our designers, and they do expect that when the code goes beyond the cache size, the performance would be worse than what you would see using the FMC-only. This is because the miss penalty is a bit higher for the cache than the FMC.

The bad news for me is that I'm pretty sure you also had this error in the code when you ran on the ST device, so your performance curves should flatten out for the ST device too. I won't pretend to be an expert on ST's ART implementation, but the documentation I've found would indicate it only helps to accelerate core instruction accesses. Our cache is more flexible, because it can also cache data accesses. Plus the FMC acceleration can also help to speed up flash accesses requested by non-core masters in the SoC. So that is something you should take into account when thinking about overall system performance as compared to straight core code execution speed.

Regards,

Melissa

peterruesch · ‎08-13-2015

Hi Melissa,

thanks for writing.

the drop of performance when die looped code size exceeds the cache size is absolutely fine. My concern was the drop BELOW FMC-only performance.

Miss-penalty seems to describe it well. And if I'm understanding corretly, I can't do anything to improve it...

This is a real bummer since we really like the K65 because it has so much pin muxing features!

I don't want to argue with you since there seem not to be much public information about ST's ART cache and it's capability of caching external memory accesses.

Grüße

Peter