Hi,
we are currently benchmarking several relevant controllers to evaluate if they fit our needs and how they compare.
First, we were really happy to have found the kinetis K65 has an internal L1 cache to boost its performance.
I did some benchmarking and the code execution performance from internal flash to get an idea of the real performance one can expect form plain code execution.
please find the resulting charts below which show the results of my benchmarking. The Y axis indicates the execution efficency. This is reduced by occuring waitstates of the internal flash.
the benchmark consists of running through a defined number of instructions and counting cpu cycles meanwhile. Increasing difference between instructions and cpu cycles indicate a bottleneck in code fetching from the internal flash.
my conclusion: ST has done sone good job on their ART-Cache. How can I get equally good results with a K65 Part?
The worst thing is, that the K65 L1 Cache could possibly limit the performance of the controller :smileysad:
What surprises me: even without prefetching and cache stuff enabled, the STM32F429 still outperforms the K65 even with activated caches in some circumstances.
please let me know your thoughts!
Hi Peter,
I'd like to get more information about your setup to try and get a better understanding of the numbers you are seeing. It does make sense that the FMC+L1 is the best option, but the code cache is larger than 2k, so I'd expect the drop-off in performance to be later and it should drop down to the FMC only level instead of being even worse than the FMC only option. So I think that something strange is going on with the tests, but I don't have an immediate idea as to what that would be.
Could you tell me more about the test software you are using? Is the code loop a single instruction repeated the specified number of times or do you have a mix of instructions? Are there data accesses associated with any of the instructions? If so, where is the target data stored? How many times are you running through the loop for your test?
It would also be good to know how the FMC and cache are configured for each of your four test cases. I'm guessing you are using the default region values for the cache, but what value are you using for the LMEM_PCCCR?
Thanks,
Melissa
Hi Melissa,
thank you very much for your attention:smileyhappy:
I will try to explain my doing in the following lines:
Melissa A Hunter:
but the code cache is larger than 2k, so I'd expect the drop-off in performance to be later and it
should drop down to the FMC only level instead of being even worse than the FMC only option.
Melissa A Hunter:
Is the code loop a single instruction repeated the specified number of times or do you have a mix of instructions?
Are there data accesses associated with any of the instructions? If so, where is the target data stored?
How many times are you running through the loop for your test?
Melissa A Hunter:
It would also be good to know how the FMC and cache are configured for each of your four test cases.
I'm guessing you are using the default region values for the cache, but what value are you using for the LMEM_PCCCR?
Melissa A Hunter:
Thanks,
Melissa
thank YOU!
Hope you can use my update to give me some more information.
Happy to hear from you :smileyhappy:
Best regards
Peter
Hi Peter,
Sorry for the delay in getting back to you. I was out on vacation for a few days last week, so I just saw your reply today.
I don't see anything obvious in what you are doing that explains the results. The performance can be affected by very small things though. Just adding/removing lines of code to control the cache can change the code alignment which can impact how efficiently the cache and FMC are able to load instructions. I suspect there might be differences in the code alignment between your FMC and L1 vs. FMC-only test cases that might be accounting for the differences you are seeing with the larger loops.
Could you share the project you are using? If I can duplicate your tests here that might be the easiest way for me to help. From your screenshots it looks like you are using KDS for your testing on Kinetis? Is that right? If so, out of curiosity, what tools are you using with the ST part?
Thanks,
Melissa
Hi Peter,
I'm getting an unexpected interrupt at some point in the 600 instruction run, so I haven't been able to duplicate your results. However, I think I have seen enough to get an idea of what is going on...
The way the arm_for_loop_flash() function is written each loopsize you are testing uses a different physical location for the loop that is tested. More importantly, the alignment for each loop also moves around because of the case statement checking. To show you what I'm talking about here the address of the first add instruction in some of the loops:
100 instruction 0x532
200 instruction 0x612
300 instruction 0x7BC
400 instruction 0xA2E
500 instruction 0xD68
As I mentioned before, the alignment is a factor in how the cache and FMC work. The cache works on 16-byte lines, and some of the code from main would be cached along with your test loop and also the case statement. Because the case statement code is pretty short, you're also caching sections of the test loop for smaller instruction sizes. The fact that the larger test loops where you are starting to run out of cache space are at the end of the case statement also doesn't help (as you've bot more case statement code to work through and extra adds near the case statement checks that are also going to find their way into the cache). If you reversed the order of the case statement, then I think that might help to move the drop off point for the cache enabled case.
I suspect that when you add in code to enable the cache it is changing the alignment of the test loops as compared to the FMC only case. I think that is why you are seeing the cache performance line drop down below the FMC only performance when you hit the point where the cache is full. The FMC has a very small cache, so changes to code alignment have a big effect on how efficient the FMC is at preventing the core from seeing wait states to the flash. The FMC on this particular device holds up to 16 entries that are each 128-bit, so you can cache up to 256 bytes of code. If you have a control loop in your application that is 248 bytes, should be no problem, right? But if the first byte of the loop is not at a 128-bit alignment, then that throws everything off. You're left with some code in one line, and trailing code in one line that doesn't quite fit in the cache. So now you have to swap lines in and out of the cache and you get wait states.
To sum up there are some things going on that I think are keeping your test from being an apples to apples comparison at least for the different modes and loop sizes on Kinetis. I think this explains some of the strange things you are seeing in your Kinetis results.
Hope this extra information helps.
Regards,
Melissa
Melissa,
thank you for taking the time to write this answer.
You have perfectly described why the performance drops when the looped code size is larger than the cached instructions. this is obvious.
In a real world scenario, with lots of legacy code (super loop), I have no canche to influence the way the code is run and so I'm faced with suboptimal conditions in which the performance drastically drops. My concern is how I can avoid this massive drop in performance unsing a K6x part. ST seems to do things more cleverly at this point as it seems they don't need extra cycles to allocate instructions in their cache. Please correct me if I'm wrong!
Can you please comment on the behavior of the K6x devices LMEM cache ? In terms of extra cycles on a occuring cache allocation in addition to the unavoidable flash access wait states...
I'm running the exact same tests on the ST and Kinetis part which produces numbers to be compared like apples to apples.
If I got your post right, you can't identify a misusing of the cache infrastructure. So the results are as good as they can get. In a realworld scenario, where no such tight loops exist, the results might be even worse :smileysad:---
to clarify: i enable and disable the LMEM cache view the embsysregview view in eclipse during the debug session.
Peter
Hi Peter,
I think I've solved at least part of the mystery...
I found an error in your macro definition for your 4k loop. You accidentally put in 5 instances of the 1k loop instead of 4. So that explains why you are seeing a decrease in performance across the board at the 4k point. If you make this change that should push out the point where the cache + FMC performance starts to decrease. I expect the drop in performance to move out so that it is at the 8k point to correspond to the Code cache size (it might actually decrease earlier because of the case statement code exposing the associativity of the cache).
I also double checked with some of our designers, and they do expect that when the code goes beyond the cache size, the performance would be worse than what you would see using the FMC-only. This is because the miss penalty is a bit higher for the cache than the FMC.
The bad news for me is that I'm pretty sure you also had this error in the code when you ran on the ST device, so your performance curves should flatten out for the ST device too. I won't pretend to be an expert on ST's ART implementation, but the documentation I've found would indicate it only helps to accelerate core instruction accesses. Our cache is more flexible, because it can also cache data accesses. Plus the FMC acceleration can also help to speed up flash accesses requested by non-core masters in the SoC. So that is something you should take into account when thinking about overall system performance as compared to straight core code execution speed.
Regards,
Melissa
Hi Melissa,
thanks for writing.
the drop of performance when die looped code size exceeds the cache size is absolutely fine. My concern was the drop BELOW FMC-only performance.
Miss-penalty seems to describe it well. And if I'm understanding corretly, I can't do anything to improve it...
This is a real bummer since we really like the K65 because it has so much pin muxing features!
I don't want to argue with you since there seem not to be much public information about ST's ART cache and it's capability of caching external memory accesses.
Grüße
Peter