Hello,
I'm trying to analyze execution time of some software parts. In particular one specific function:
Solved! Go to Solution.
Because the core usually accesses "external" resources (RAM, peripherals which are running at lower speed) a lot in typical program flow, it always inserts delay and wait states. The cache memory helps a lot here. And except stack, sometimes it makes sense to force certain variables/arrays which are used very frequently to cache.But in overall, it sounds like reasonable numbers.
Regards,
Lukas
Hi Dariusz,
The calculation is not so straightforward, there are much more variables. It’s not about core clock only, it depends also on the rest of the system which is slower than core clock. It depends if the code is already cached. If not, flash wait states will add some delay because the flash is not so fast. It also depends on the code position. For example, short piece of code can be placed in one flash line only. If it is shifted a little bit, it can be spread over two flash lines, so two physical reads are needed (adding more wait states). It depends on other bus masters (second core, DMA) – on traffic on crossbar switch and on crossbar switch configuration (priorities). Because the MPC5777C (as the only one from MPC57xx family) was supposed to be backward compatible with MPC5676R/MPC5674F, e2eECC slightly affects the performance because for every SRAM and DMA transfer initiation, e2eECC on MPC5777C requires 2 additional clock cycles. If a code access SRAM variables, it depends if those are cached or not.
So, due to this huge variability, I’m used to recommend different methods than calculation of asm instructions – tracing, toggling a pin before and after execution of certain code and checking by an oscilloscope...
Some details about optimizations can be found in this application note:
https://www.nxp.com/docs/en/application-note/AN5191.pdf
Regards,
Lukas
Hi Lukasz,
Thank you for your answer. I have already went through this document and implemented most if not all of the optimizations mentioned inside it. I have made a measurement with tracing tool and was just trying to understand this huge difference between theoretical and actual times. The measured time is 2,75-4,13 times longer than theoretical. We have enabled flash optimization, branch target buffer, instruction and data cache. Our system is also using DMA quite heavily so we have elevated DMA priority on XBAR. The core 1 is disabled in our case so it shouldn't interfere with core 0. One thing that i didn't implement yet is moving stack to cache but i wouldn't expect the gains to be this big. Do you have any other suggestions that i could investigate to improve performance?
Best regards
Hi Dariusz,
"One thing that i didn't implement yet is moving stack to cache but i wouldn't expect the gains to be this big."
- please try that. Due to mentioned e2eECC, this can really make significant difference. It is highly recommended to put the stack to cache due to this 2 clock delay. I'm sure this will improve the performance.
Regards,
Lukas
Hi Lukasz,
I will do that, thanks.
Is there some estimation that can be done that would justify 2,75-4,13 longer execution time? Just to say that this execution time is reasonable and not because of some misconfiguration?
Best regards
Because the core usually accesses "external" resources (RAM, peripherals which are running at lower speed) a lot in typical program flow, it always inserts delay and wait states. The cache memory helps a lot here. And except stack, sometimes it makes sense to force certain variables/arrays which are used very frequently to cache.But in overall, it sounds like reasonable numbers.
Regards,
Lukas