performance problem with LPCXpresso 5.XX

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Thu Dec 06 20:40:59 MST 2012
I moved from versions 4.1.0_190 to 4.3.0_1023 to 5.0.10_1066 to
5.0.12_1083 over the last year or so.   The move from 4 to 5 seriously
degraded the speed of one of my critical routines.   The disassembled code
looks better in V5 from V4 but is slower for some reason.   The number of
machine instructions in my critical loop went from 10 in V4 to 8 in V5.   A
significant improvement.   But why is it slower?   It runs at about half the speed
as the older and larger version.

Both versions are logically correct code, they produce the correct results.
They both were compiled with -O3 optimization in "Release" mode.
The source is written in C.   The code runs in RAM, not Flash.   The CPU is
an LPC1114 running at 48 MHz.

Here is the C code.   The code takes data from an array and writes each
bit out an I/O port bit by bit.   It is the speed of the loop that is of interest.
The variables not defined in this code are globals.   The capitalized ones
are #defines.

void
disp_bit_pat () 
{
 int i_temp;
 uint32_t shift;
 uint32_t font_pat;
 uint32_t mux_addr;
 
 /*
  * We use the masked I/O feature of the Cortex chips.
  */
 mux_addr = MUX_WHT_BIT;
 
 shift = MUX_WHT_POS - scan_pos_cnt;
 
 for (i_temp = 0; i_temp < text_pat_width[txt_row]; i_temp++)
 {
 
  font_pat = text_pat[txt_row][i_temp];
 
  MUX_GPIO->MASKED_ACCESS[mux_addr] = (font_pat << shift) |(font_pat >> (32 - shift));
 }
}

This is the disassembled code from version 4.   I have the critical loop
marked as "inner loop".   This is the code that runs fast.

This assembly is copied from the assembly view and not the assembly
listing.   This code does not appear in the listing because it resides in
a data segment where it is copied into RAM.

 
disp_bit_pat:
0x1000000c: disp_bit_pat+0     push {r4, r5, r6, lr}
0x1000000e: disp_bit_pat+2     ldr r3, [pc, #96] 
0x10000010: disp_bit_pat+4     ldr r5, [pc, #96] 
0x10000012: disp_bit_pat+6     ldr r3, [r3, #0]          
0x10000014: disp_bit_pat+8     ldr r2, [pc, #96]  
0x10000016: disp_bit_pat+10    lsls r1, r3, #2          
0x10000018: disp_bit_pat+12    ldr r4, [r5, #0]            
0x1000001a: disp_bit_pat+14    ldr r5, [r1, r2]            
0x1000001c: disp_bit_pat+16    movs r0, #3              
0x1000001e: disp_bit_pat+18    subs r0, r0, r4
0x10000022: disp_bit_pat+22    ble.n 0x1000006e 
0x10000024: disp_bit_pat+24    lsls r1, r3, #5
0x10000026: disp_bit_pat+26    adds r6, r1, r3
0x10000028: disp_bit_pat+28    lsls r4, r6, #3
0x1000002a: disp_bit_pat+30    ldr r2, [pc, #80]   
0x1000002c: disp_bit_pat+32    adds r1, r4, r3
0x1000002e: disp_bit_pat+34    adds r3, r1, r2
0x10000030: disp_bit_pat+36    ldrb r6, [r3, #0]
0x10000032: disp_bit_pat+38    movs r2, #32
0x10000034: disp_bit_pat+40    subs r0, r2, r0
0x10000036: disp_bit_pat+42    ldr r1, [pc, #72]  
0x10000038: disp_bit_pat+44    subs r2, #31
0x1000003a: disp_bit_pat+46    subs r4, r5, #1
0x1000003c: disp_bit_pat+48    rors r6, r0
0x1000003e: disp_bit_pat+50    ands r4, r2
0x10000040: disp_bit_pat+52    str r6, [r1, #32]
0x10000042: disp_bit_pat+54    adds r3, r3, r2
0x10000044: disp_bit_pat+56    cmp r2, r5
0x10000046: disp_bit_pat+58    beq.n 0x1000006e 
0x10000048: disp_bit_pat+60    cmp r4, #0
0x1000004a: disp_bit_pat+62    beq.n 0x1000005a 
0x1000004c: disp_bit_pat+64    ldrb r2, [r3, #0]
0x1000004e: disp_bit_pat+66    adds r3, #1
0x10000050: disp_bit_pat+68    rors r2, r0
0x10000052: disp_bit_pat+70    str r2, [r1, #32]
0x10000054: disp_bit_pat+72    movs r2, #2
0x10000056: disp_bit_pat+74    cmp r2, r5
0x10000058: disp_bit_pat+76    beq.n 0x1000006e 
 
Inner loop.
0x1000005a: disp_bit_pat+78    ldrb r6, [r3, #0]
0x1000005c: disp_bit_pat+80    adds r2, #2
0x1000005e: disp_bit_pat+82    rors r6, r0
0x10000060: disp_bit_pat+84    str r6, [r1, #32]
0x10000062: disp_bit_pat+86    ldrb r4, [r3, #1]
0x10000064: disp_bit_pat+88    adds r3, #2
0x10000066: disp_bit_pat+90    rors r4, r0
0x10000068: disp_bit_pat+92    str r4, [r1, #32]
0x1000006a: disp_bit_pat+94    cmp r2, r5
0x1000006c: disp_bit_pat+96    bne.n 0x1000005a 
End of inner loop.
 
0x1000006e: disp_bit_pat+98    pop {r4, r5, r6, pc}

This is the disassembled code from version 5. I have the critical loop
marked as "inner loop". This is the code that runs slow.

          disp_bit_pat:
10000014:   push {r4, r5, r6, r7, lr}
10000016:   ldr r2, [pc, #56] ;
10000018:   ldr r0, [pc, #56] ; 
1000001a:   ldr r1, [r2, #0] 
1000001c:   ldr r2, [r0, #0]
1000001e:   movs r5, #3 
10000020:   ldr r0, [pc, #52] ; 
10000022:   subs r5, r5, r1 
10000024:   lsls r1, r2, #2
10000026:   ldr r3, [r0, r1] 
10000028:   cmp r3, #0
1000002a:   ble.n 0x1000004e 
1000002c:   lsls r6, r2, #5
1000002e:   adds r4, r6, r2
10000030:   lsls r6, r4, #3
10000032:   movs r4, #32
10000034:   subs r5, r4, r5
10000036:   ldr r7, [pc, #36] ; 
10000038:   ldr r4, [pc, #36] ; 
1000003a:   movs r3, #0
1000003c:   adds r6, r6, r2
 
Inner loop.
1000003e:   adds r2, r7, r6
10000040:   ldrb r2, [r2, r3] 
10000042:   adds r3, #1   
10000044:   rors r2, r5
10000046:   str r2, [r4, #32]  
10000048:   ldr r2, [r0, r1]  
1000004a:   cmp r2, r3
1000004c:   bgt.n 0x1000003e 
End of inner loop.
 
1000004e:   pop {r4, r5, r6, r7, pc}

My normal response to performance problems is to write the code in
assembler but the compiler code looks good and I probably could not
do any better. Especially since the version 5 code looks really good
but runs really slowly.

And yes, I am absolutley positive that I an not running in "debug" mode!

Any insights would be greatly appreciated,
Pete.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by CodeRedSupport on Sun Dec 09 03:12:36 MST 2012
Please provide an example that demonstrates the problem on a LPC11 LPCXpresso problem. Unless we can reproduce here, we can only speculate as to what you may have done wrong.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Sat Dec 08 20:19:36 MST 2012
I repeated my tests running in both RAM and Flash for the two compilers.
I noticed a mistake in the previous results, I incorrectly gave the numbers
for Flash with the V4 compiler and for RAM with the V5 compiler.
The correct results are even more different for the two compilers.
The numbers are cycle times in microseconds for the loop in the
previously listed code.

V4 compiler using Flash and RAM with the different optimization levels;
[FONT=Courier New]Opt    Flash    RAM[/FONT]
[FONT=Courier New]-O3    0.20     0.16[/FONT]
[FONT=Courier New]-O2    0.27     0.20[/FONT]
[FONT=Courier New]-O1 0.40     0.31[/FONT]
[FONT=Courier New]-O0 2.20 1.30[/FONT]
[FONT=Courier New]-Os 0.88 0.55[/FONT]

V5 compiler;
[FONT=Courier New]Opt    Flash    RAM[/FONT]
[FONT=Courier New]-O3    0.42 0.30[/FONT]
[FONT=Courier New]-O2    0.42 0.30[/FONT]
[FONT=Courier New]-O1 0.44 0.30[/FONT]
[FONT=Courier New]-O0 2.20 1.30[/FONT]
[FONT=Courier New]-Os 0.88 0.52[/FONT]

I think that this shows a broken compiler optimizer in V5.   The performance
numbers do not show any improvement at optimization levels above -O1.
The V4 compiler looks as I would expect with increasing optimization.
The performance at optimization at -O1 and below is the same for the
two versions.

I have not made any attempt to align the code in memory to gain any
caching advantages.   There may be some performance advantages
to be had there.

Pete.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by CodeRedSupport on Sat Dec 08 12:59:30 MST 2012
I see no signs of any problem in the compiler here (though admittedly you haven't actually provided a compilable example, so this is just from looking at your assembler output). And as you point out, the straight forward cycle count times are better for the gcc 4.6.2 compiler in LPCXpresso v5.

I can't help wondering if you are perhaps seeing a GPIO timing issue, or perhaps more likely an issue being caused by both running from and making data access to RAM in a tight loop.

Be interesting to see what happened if you ran this code from Flash instead.

Regards,
CodeRedSupport

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Sat Dec 08 11:35:19 MST 2012
I wanted to look at the optimizer effect in the two versions of LPCXpresso.
The results are rather curious!

Here are the measured performances of the loop in question between the
two versions and at all optimizer settings;

LPCXpresso V4 loop speed;
-O3     0.2uS
-O2     0.24uS
-O1     0.63uS
-O0     2.2uS
-Os     0.88uS

LPCXpresso V5 loop speed;
-O3     0.3uS
-O2     0.3uS
-O1     0.3uS
-O0 1.3uS
-Os     0.52uS

In V5, the optimizer settings above -O1 have no effect on performance
of this loop.    Did something break?

Pete.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Sat Dec 08 10:47:08 MST 2012
I decided to do some more rigorous timing measurements of the two loops.
The V4 loop cycles at a rate of about 0.2 uS. This is much faster than
the op code cycle count of 16 would imply which would be .32uS.
The CPU is running at 48 MHz which is about .02 uS per cycle.

The V5 loop cycles at about 0.27 uS which corresponds with the code
cycle count of 13. There seems to be no speed up.

There must be significant processor pipeline effeciencies with the V4 code
that are not there in the V5 code.

There is the V4 code with the cycle counts listed;

Inner loop.      Cycles.
0x1000005a: disp_bit_pat+78    ldrb r6, [r3, #0]              2
0x1000005c: disp_bit_pat+80    adds r2, #2                     1
0x1000005e: disp_bit_pat+82    rors r6, r0                       1
0x10000060: disp_bit_pat+84    str r6, [r1, #32]              2
0x10000062: disp_bit_pat+86    ldrb r4, [r3, #1]              2 
0x10000064: disp_bit_pat+88    adds r3, #2                      1
0x10000066: disp_bit_pat+90    rors r4, r0                       1
0x10000068: disp_bit_pat+92    str r4, [r1, #32]               2
0x1000006a: disp_bit_pat+94    cmp r2, r5                       1
0x1000006c: disp_bit_pat+96    bne.n 0x1000005a          3
End of inner loop.                                         Total 16

There is the V5 code with the cycle counts listed;

Inner loop.     Cycles.
1000003e:   adds r2, r7, r6                   1 
10000040:   ldrb r2, [r2, r3]                 2  
10000042:   adds r3, #1                       1  
10000044:   rors r2, r5                         1
10000046:   str r2, [r4, #32]                2  
10000048:   ldr r2, [r0, r1]                  2  
1000004a:   cmp r2, r3                        1
1000004c:   bgt.n 0x1000003e            3
End of inner loop.                    total    13

lpcware · ‎06-15-2016

Content originally posted in LPCWare by TheFallGuy on Sat Dec 08 03:05:02 MST 2012
There must be something different with your environment as the code is obviously 'better'. Clock speed?

Hmm, just had one thought. Is there a minimum allowed time between gpio accesses? Could this new code be exceeding that?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Fri Dec 07 16:34:15 MST 2012
I did try changing the alignment of both the code and the data using padding but to no avail. The array being accessed is a byte array so alignment will be on a word only 1 in 4 times but this is the way it has always worked even in V4.

Thanks for your suggestions,
Pete.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by TommyBurger on Fri Dec 07 12:00:40 MST 2012
Check for unaligned data access. If r1 is not pointing to a word address, there could be some delay.

Verify clock is running the same on both versions.

I know when running from flash, instruction alignment affected timing. I had a lot of frustration optimizing assembler until I forced the assembler function to start on a four word boundary. I don't think that happens running from ram, but it is something to look at.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Fri Dec 07 09:25:30 MST 2012
I counted the clock cycles for the two examples. The V5 code uses 13 cycles and the V4 code uses 16 cycles to complete a loop. It seems that the V5 code should run faster.

Another factor to consider is the pipeline. Some instructions can run in parallel and some must be serialized. Could something in the V5 code cause a pipeline flush?

Pete.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Fri Dec 07 07:19:13 MST 2012

Quote: TheFallGuy
How do you know that it is running slower? How are you measuring it? Could there be other external factors that are causing the problem - such as a lot of interrupts?

The loop writes the bit pattern to an I/O port. I can see it on a scope.

This routine runs with interrupts disabled.

Another curious thing: in V4 when I instruction step through the routine, the PC advances each op code as expected. In V5 it frequently jumps ahead a few op codes as you step along. This is assembly level stepping so it should not be affected by compiler optimization level.

Pete.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by TheFallGuy on Fri Dec 07 02:07:17 MST 2012
How do you know that it is running slower? How are you measuring it? Could there be other external factors that are causing the problem - such as a lot of interrupts?