MPC5777C Assembler instruction execution time

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

MPC5777C Assembler instruction execution time

Jump to solution
895 Views
darq
Contributor III

Hello,

I'm trying to analyze execution time of some software parts. In particular one specific function:

int32_t CheckLimit(int32_t LowerValue, int32_t ActualValue, int32_t UpperValue)
{
    if (ActualValue < LowerValue)
    {
        return LowerValue;
    }
    else if (ActualValue > UpperValue)
    {
        return UpperValue;
    }
    else
    {
        return ActualValue;
    }
}
 
After compilation without optimizations it gets translated to:
Address:             Instruction:      Mnemonic:                                                       Instruction latency:       Meaning:                     
0xA00300CC     182106E0        e_stwu r1,-0x20(r1) ; r1,-32(r1)                       3                                    store word with update (reduce stack pointer by 32)
0xA00300D0      D7F1               se_stw r31,0x1C(r1) ; r31,28(r1)                     3                                    store word (copy stack pointer to stack?) r1 = 40051578 r31 = 40051598)
0xA00300D2      011F                se_mr r31,r1                                                    1                                    move register (copy stack pointer value to r31)
0xA00300D4      D23F               se_stw r3,0x8(r31) ; r3,8(r31)                          3                                    store word (copy LowerValue to stack)
0xA00300D6      D34F               se_stw r4,0x0C(r31) ; r4,12(r31)                     3                                   store word (copy ActualValue to stack)
0xA00300D8      D45F               se_stw r5,0x10(r31) ; r5,16(r31)                      3                                   store word (copy UpperValue to stack)
if (ActualValue < LowerValue)
0xA00300DA      C36F               se_lwz r6,0x0C(r31) ; r6,12(r31)                      3                                   load word and zero (copy ActualValue from stack to r6)
0xA00300DC      C27F               se_lwz r7,0x8(r31) ; r7,8(r31)                          3                                   load word and zero (copy LowerValue from stack to r7)
0xA00300DE      7F863800        cmp 0x7,0x0,r6,r7 ; 7,0,r6,r7                           1                                   compare word (compare ActualValue and LowerValue and put result in CR7)
0xA00300E2      7CE00026        mfcr r7                                                             1                                   move from condition register to R7 (get comparison result)
0xA00300E6      74E7E007        e_rlwinm r7,r7,0x1C,0x0,0x3 ; r7,r7,28,0,3     1                                 rotate left 28 bits word immediate then AND with mask
                                                                                                                                                                     0-3bits (get the result of comparison into r7?)
0xA00300EA       7CE80120       mtcrf r7,0x80 ; r7,128                                      2                                 move to condition register fields (copy contents of r7 to CR0?)
0xA00300EE       7A00000A       e_bge 0xA00300F8                                         6/4/1                             branch if greater than or equal (go to next condition check)
{
    return LowerValue;
0xA00300F2        C27F               se_lwz r7,0x8(r31) ; r7,8(r31)                         3                                   load word and zero (copy LowerValue from stack to r7)
0xA00300F4        78000024        e_b 0xA0030118                                            6/4/1                           branch (go to function exit)
}
else if (ActualValue > UpperValue)
0xA00300F8        C36F               se_lwz r6,0x0C(r31) ; r6,12(r31)                    3                                   load word and zero (copy ActualValue from stack to r6)
0xA00300FA        C47F               se_lwz r7,0x10(r31) ; r7,16(r31)                     3                                   load word and zero (copy UpperValue from stack to r7)
0xA00300FC        7F863800       cmp 0x7,0x0,r6,r7 ; 7,0,r6,r7                          1                               compare word (compare ActualValue and UpperValue and put result in CR7)
0xA0030100        7CE00026       mfcr r7                                                            1                                 move from condition register to R7 (get comparison result)
0xA0030104        74E7E007       e_rlwinm r7,r7,0x1C,0x0,0x3 ; r7,r7,28,0,3    1                                   rotate left 28 bits word immediate then AND with mask
                                                                                                                                                                     0-3bits (get the result of comparison into r7?)
0xA0030108        7CE80120       mtcrf r7,0x80 ; r7,128                                     2                                 move to condition register fields (copy contents of r7 to CR0?)
0xA003010C       7A01000A        e_ble 0xA0030116                                         6/4/1                         branch if less than or equal (go to return ActualValue)
{
     return UpperValue;
0xA0030110        C47F                se_lwz r7,0x10(r31) ; r7,16(r31)                    3                                   load word and zero (copy UpperValue from stack to r7)
0xA0030112        78000006         e_b 0xA0030118                                           6/4/1                            branch (go to function exit)
}
else
{
     return ActualValue;
0xA0030116         C37F                se_lwz r7,0x0C(r31) ; r7,12(r31)                   3                                   load word and zero (copy ActualValue from stack to r7)
}
0xA0030118         0173                 se_mr r3,r7                                                   1                                 move register (copy r7 (result?) value to r3)
0xA003011A        197F8020         e_addi r11,r31,0x20 ; r11,r31,32                   1                                     add immediate (increase stack pointer by 32)
0xA003011E        53EBFFFC       e_lwz r31,-0x4(r11) ; r31,-4(r11)                    3                                     load word and zero (copy new stack pointer to r31)
0xA0030122        0331                 se_mfar r1,r11                                               ?3/1                             move from alternate register (copy new stack pointer to r1)
0xA0030124        0004                 se_blr                                                             6/4/1                           branch to link register (return)
0xA0030126        0000                 se_illegal                                                        illegal
 
If i understand correctly the instruction latency which i found in core reference manual is the instruction execution time in core clock cycles. With 264MHz core clock frequency the period of one clock cycle should be around 3,(78)ns. Counting the shortest path and the longest path for this function execution should result in 39 and 73 clock cycles respectively. Multiplying those values by time of one clock cycle should roughly result in 147,73ns shortest execution and 276,52ns longest execution time. The times measured with tracing tool provide values of 405ns shortest and 1140ns longest time. What could be the reason for such big discrepancy? There doesn't seem to be any interrupts during this function execution. 
 
Best regards
 
 
0 Kudos
1 Solution
864 Views
lukaszadrapa
NXP TechSupport
NXP TechSupport

Because the core usually accesses "external" resources (RAM, peripherals which are running at lower speed) a lot in typical program flow, it always inserts delay and wait states. The cache memory helps a lot here. And except stack, sometimes it makes sense to force certain variables/arrays which are used very frequently to cache.But in overall, it sounds like reasonable numbers.

Regards,

Lukas

View solution in original post

0 Kudos
5 Replies
883 Views
lukaszadrapa
NXP TechSupport
NXP TechSupport

Hi Dariusz,

The calculation is not so straightforward, there are much more variables. It’s not about core clock only, it depends also on the rest of the system which is slower than core clock. It depends if the code is already cached. If not, flash wait states will add some delay because the flash is not so fast. It also depends on the code position. For example, short piece of code can be placed in one flash line only. If it is shifted a little bit, it can be spread over two flash lines, so two physical reads are needed (adding more wait states). It depends on other bus masters (second core, DMA) – on traffic on crossbar switch and on crossbar switch configuration (priorities). Because the MPC5777C (as the only one from MPC57xx family) was supposed to be backward compatible with MPC5676R/MPC5674F, e2eECC slightly affects the performance because for every SRAM and DMA transfer initiation, e2eECC on MPC5777C requires 2 additional clock cycles. If a code access SRAM variables, it depends if those are cached or not.

So, due to this huge variability, I’m used to recommend different methods than calculation of asm instructions – tracing, toggling a pin before and after execution of certain code and checking by an oscilloscope...

Some details about optimizations can be found in this application note:

https://www.nxp.com/docs/en/application-note/AN5191.pdf

Regards,

Lukas

0 Kudos
877 Views
darq
Contributor III

Hi Lukasz,

Thank you for your answer. I have already went through this document and implemented most if not all of the optimizations mentioned inside it. I have made a measurement with tracing tool and was just trying to understand this huge difference between theoretical and actual times. The measured time is 2,75-4,13 times longer than theoretical. We have enabled flash optimization, branch target buffer, instruction and data cache. Our system is also using DMA quite heavily so we have elevated DMA priority on XBAR. The core 1 is disabled in our case so it shouldn't interfere with core 0. One thing that i didn't implement yet is moving stack to cache but i wouldn't expect the gains to be this big. Do you have any other suggestions that i could investigate to improve performance?

Best regards

 

0 Kudos
873 Views
lukaszadrapa
NXP TechSupport
NXP TechSupport

Hi Dariusz,

"One thing that i didn't implement yet is moving stack to cache but i wouldn't expect the gains to be this big."

- please try that. Due to mentioned e2eECC, this can really make significant difference. It is highly recommended to put the stack to cache due to this 2 clock delay. I'm sure this will improve the performance.

Regards,

Lukas

0 Kudos
869 Views
darq
Contributor III

Hi Lukasz,

I will do that, thanks.

Is there some estimation that can be done that would justify 2,75-4,13 longer execution time? Just to say that this execution time is reasonable and not because of some misconfiguration?

Best regards

0 Kudos
865 Views
lukaszadrapa
NXP TechSupport
NXP TechSupport

Because the core usually accesses "external" resources (RAM, peripherals which are running at lower speed) a lot in typical program flow, it always inserts delay and wait states. The cache memory helps a lot here. And except stack, sometimes it makes sense to force certain variables/arrays which are used very frequently to cache.But in overall, it sounds like reasonable numbers.

Regards,

Lukas

0 Kudos