MPC5777C Assembler instruction execution time

darq · ‎04-26-2022

Hello,

I'm trying to analyze execution time of some software parts. In particular one specific function:

int32_t CheckLimit(int32_t LowerValue, int32_t ActualValue, int32_t UpperValue)

{

if (ActualValue < LowerValue)

{

return LowerValue;

}

else if (ActualValue > UpperValue)

{

return UpperValue;

}

else

{

return ActualValue;

}

After compilation without optimizations it gets translated to:

Address: Instruction: Mnemonic: Instruction latency: Meaning:
0xA00300CC 182106E0 e_stwu r1,-0x20(r1) ; r1,-32(r1) 3 store word with update (reduce stack pointer by 32)
0xA00300D0 D7F1 se_stw r31,0x1C(r1) ; r31,28(r1) 3 store word (copy stack pointer to stack?) r1 = 40051578 r31 = 40051598)
0xA00300D2 011F se_mr r31,r1 1 move register (copy stack pointer value to r31)
0xA00300D4 D23F se_stw r3,0x8(r31) ; r3,8(r31) 3 store word (copy LowerValue to stack)
0xA00300D6 D34F se_stw r4,0x0C(r31) ; r4,12(r31) 3 store word (copy ActualValue to stack)
0xA00300D8 D45F se_stw r5,0x10(r31) ; r5,16(r31) 3 store word (copy UpperValue to stack)
if (ActualValue < LowerValue)
0xA00300DA C36F se_lwz r6,0x0C(r31) ; r6,12(r31) 3 load word and zero (copy ActualValue from stack to r6)
0xA00300DC C27F se_lwz r7,0x8(r31) ; r7,8(r31) 3 load word and zero (copy LowerValue from stack to r7)
0xA00300DE 7F863800 cmp 0x7,0x0,r6,r7 ; 7,0,r6,r7 1 compare word (compare ActualValue and LowerValue and put result in CR7)
0xA00300E2 7CE00026 mfcr r7 1 move from condition register to R7 (get comparison result)
0xA00300E6 74E7E007 e_rlwinm r7,r7,0x1C,0x0,0x3 ; r7,r7,28,0,3 1 rotate left 28 bits word immediate then AND with mask
0-3bits (get the result of comparison into r7?)
0xA00300EA 7CE80120 mtcrf r7,0x80 ; r7,128 2 move to condition register fields (copy contents of r7 to CR0?)
0xA00300EE 7A00000A e_bge 0xA00300F8 6/4/1 branch if greater than or equal (go to next condition check)
{
return LowerValue;
0xA00300F2 C27F se_lwz r7,0x8(r31) ; r7,8(r31) 3 load word and zero (copy LowerValue from stack to r7)
0xA00300F4 78000024 e_b 0xA0030118 6/4/1 branch (go to function exit)
}
else if (ActualValue > UpperValue)
0xA00300F8 C36F se_lwz r6,0x0C(r31) ; r6,12(r31) 3 load word and zero (copy ActualValue from stack to r6)
0xA00300FA C47F se_lwz r7,0x10(r31) ; r7,16(r31) 3 load word and zero (copy UpperValue from stack to r7)
0xA00300FC 7F863800 cmp 0x7,0x0,r6,r7 ; 7,0,r6,r7 1 compare word (compare ActualValue and UpperValue and put result in CR7)
0xA0030100 7CE00026 mfcr r7 1 move from condition register to R7 (get comparison result)
0xA0030104 74E7E007 e_rlwinm r7,r7,0x1C,0x0,0x3 ; r7,r7,28,0,3 1 rotate left 28 bits word immediate then AND with mask
0-3bits (get the result of comparison into r7?)
0xA0030108 7CE80120 mtcrf r7,0x80 ; r7,128 2 move to condition register fields (copy contents of r7 to CR0?)
0xA003010C 7A01000A e_ble 0xA0030116 6/4/1 branch if less than or equal (go to return ActualValue)
{
return UpperValue;
0xA0030110 C47F se_lwz r7,0x10(r31) ; r7,16(r31) 3 load word and zero (copy UpperValue from stack to r7)
0xA0030112 78000006 e_b 0xA0030118 6/4/1 branch (go to function exit)
}
else
{
return ActualValue;
0xA0030116 C37F se_lwz r7,0x0C(r31) ; r7,12(r31) 3 load word and zero (copy ActualValue from stack to r7)
}
0xA0030118 0173 se_mr r3,r7 1 move register (copy r7 (result?) value to r3)
0xA003011A 197F8020 e_addi r11,r31,0x20 ; r11,r31,32 1 add immediate (increase stack pointer by 32)
0xA003011E 53EBFFFC e_lwz r31,-0x4(r11) ; r31,-4(r11) 3 load word and zero (copy new stack pointer to r31)
0xA0030122 0331 se_mfar r1,r11 ?3/1 move from alternate register (copy new stack pointer to r1)
0xA0030124 0004 se_blr 6/4/1 branch to link register (return)
0xA0030126 0000 se_illegal illegal

If i understand correctly the instruction latency which i found in core reference manual is the instruction execution time in core clock cycles. With 264MHz core clock frequency the period of one clock cycle should be around 3,(78)ns. Counting the shortest path and the longest path for this function execution should result in 39 and 73 clock cycles respectively. Multiplying those values by time of one clock cycle should roughly result in 147,73ns shortest execution and 276,52ns longest execution time. The times measured with tracing tool provide values of 405ns shortest and 1140ns longest time. What could be the reason for such big discrepancy? There doesn't seem to be any interrupts during this function execution.

Best regards

lukaszadrapa · ‎04-28-2022

Because the core usually accesses "external" resources (RAM, peripherals which are running at lower speed) a lot in typical program flow, it always inserts delay and wait states. The cache memory helps a lot here. And except stack, sometimes it makes sense to force certain variables/arrays which are used very frequently to cache.But in overall, it sounds like reasonable numbers.

Regards,

Lukas

View solution in original post

lukaszadrapa · ‎04-26-2022

Hi Dariusz,

The calculation is not so straightforward, there are much more variables. It’s not about core clock only, it depends also on the rest of the system which is slower than core clock. It depends if the code is already cached. If not, flash wait states will add some delay because the flash is not so fast. It also depends on the code position. For example, short piece of code can be placed in one flash line only. If it is shifted a little bit, it can be spread over two flash lines, so two physical reads are needed (adding more wait states). It depends on other bus masters (second core, DMA) – on traffic on crossbar switch and on crossbar switch configuration (priorities). Because the MPC5777C (as the only one from MPC57xx family) was supposed to be backward compatible with MPC5676R/MPC5674F, e2eECC slightly affects the performance because for every SRAM and DMA transfer initiation, e2eECC on MPC5777C requires 2 additional clock cycles. If a code access SRAM variables, it depends if those are cached or not.

So, due to this huge variability, I’m used to recommend different methods than calculation of asm instructions – tracing, toggling a pin before and after execution of certain code and checking by an oscilloscope...

Some details about optimizations can be found in this application note:

https://www.nxp.com/docs/en/application-note/AN5191.pdf

Regards,

Lukas

darq · ‎04-27-2022

Hi Lukasz,

Thank you for your answer. I have already went through this document and implemented most if not all of the optimizations mentioned inside it. I have made a measurement with tracing tool and was just trying to understand this huge difference between theoretical and actual times. The measured time is 2,75-4,13 times longer than theoretical. We have enabled flash optimization, branch target buffer, instruction and data cache. Our system is also using DMA quite heavily so we have elevated DMA priority on XBAR. The core 1 is disabled in our case so it shouldn't interfere with core 0. One thing that i didn't implement yet is moving stack to cache but i wouldn't expect the gains to be this big. Do you have any other suggestions that i could investigate to improve performance?

Best regards

lukaszadrapa · ‎04-27-2022

Hi Dariusz,

"One thing that i didn't implement yet is moving stack to cache but i wouldn't expect the gains to be this big."

- please try that. Due to mentioned e2eECC, this can really make significant difference. It is highly recommended to put the stack to cache due to this 2 clock delay. I'm sure this will improve the performance.

Regards,

Lukas

darq · ‎04-28-2022

Hi Lukasz,

I will do that, thanks.

Is there some estimation that can be done that would justify 2,75-4,13 longer execution time? Just to say that this execution time is reasonable and not because of some misconfiguration?

Best regards

lukaszadrapa · ‎04-28-2022

Because the core usually accesses "external" resources (RAM, peripherals which are running at lower speed) a lot in typical program flow, it always inserts delay and wait states. The cache memory helps a lot here. And except stack, sometimes it makes sense to force certain variables/arrays which are used very frequently to cache.But in overall, it sounds like reasonable numbers.

Regards,

Lukas