performance problem with LPCXpresso 5.XX

Discussion created by lpcware Employee on Jun 15, 2016
Latest reply on Jun 15, 2016 by lpcware
Content originally posted in LPCWare by micrio on Thu Dec 06 20:40:59 MST 2012
I moved from versions 4.1.0_190 to 4.3.0_1023 to 5.0.10_1066 to
5.0.12_1083 over the last year or so.   The move from 4 to 5 seriously
degraded the speed of one of my critical routines.   The disassembled code
looks better in V5 from V4 but is slower for some reason.   The number of
machine instructions in my critical loop went from 10 in V4 to 8 in V5.   A
significant improvement.   But why is it slower?   It runs at about half the speed
as the older and larger version.

Both versions are logically correct code, they produce the correct results.  
They both were compiled with -O3 optimization in "Release" mode.  
The source is written in C.   The code runs in RAM, not Flash.   The CPU is
an LPC1114 running at 48 MHz.

Here is the C code.   The code takes data from an array and writes each
bit out an I/O port bit by bit.   It is the speed of the loop that is of interest.
The variables not defined in this code are globals.   The capitalized ones
are #defines.

disp_bit_pat ()
int i_temp;
uint32_t shift;
uint32_t font_pat;
uint32_t mux_addr;

  * We use the masked I/O feature of the Cortex chips.
mux_addr = MUX_WHT_BIT;

shift = MUX_WHT_POS - scan_pos_cnt;

for (i_temp = 0; i_temp < text_pat_width[txt_row]; i_temp++)

  font_pat = text_pat[txt_row][i_temp];

  MUX_GPIO->MASKED_ACCESS[mux_addr] = (font_pat << shift) |(font_pat >> (32 - shift));

This is the disassembled code from version 4.   I have the critical loop
marked as "inner loop".   This is the code that runs fast.

This assembly is copied from the assembly view and not the assembly
listing.   This code does not appear in the listing because it resides in
a data segment where it is copied into RAM.

0x1000000c: disp_bit_pat+0     push {r4, r5, r6, lr}
0x1000000e: disp_bit_pat+2     ldr r3, [pc, #96]
0x10000010: disp_bit_pat+4     ldr r5, [pc, #96]
0x10000012: disp_bit_pat+6     ldr r3, [r3, #0]         
0x10000014: disp_bit_pat+8     ldr r2, [pc, #96] 
0x10000016: disp_bit_pat+10    lsls r1, r3, #2         
0x10000018: disp_bit_pat+12    ldr r4, [r5, #0]           
0x1000001a: disp_bit_pat+14    ldr r5, [r1, r2]           
0x1000001c: disp_bit_pat+16    movs r0, #3             
0x1000001e: disp_bit_pat+18    subs r0, r0, r4
0x10000022: disp_bit_pat+22    ble.n 0x1000006e
0x10000024: disp_bit_pat+24    lsls r1, r3, #5
0x10000026: disp_bit_pat+26    adds r6, r1, r3
0x10000028: disp_bit_pat+28    lsls r4, r6, #3
0x1000002a: disp_bit_pat+30    ldr r2, [pc, #80]  
0x1000002c: disp_bit_pat+32    adds r1, r4, r3
0x1000002e: disp_bit_pat+34    adds r3, r1, r2
0x10000030: disp_bit_pat+36    ldrb r6, [r3, #0]
0x10000032: disp_bit_pat+38    movs r2, #32
0x10000034: disp_bit_pat+40    subs r0, r2, r0
0x10000036: disp_bit_pat+42    ldr r1, [pc, #72] 
0x10000038: disp_bit_pat+44    subs r2, #31
0x1000003a: disp_bit_pat+46    subs r4, r5, #1
0x1000003c: disp_bit_pat+48    rors r6, r0
0x1000003e: disp_bit_pat+50    ands r4, r2
0x10000040: disp_bit_pat+52    str r6, [r1, #32]
0x10000042: disp_bit_pat+54    adds r3, r3, r2
0x10000044: disp_bit_pat+56    cmp r2, r5
0x10000046: disp_bit_pat+58    beq.n 0x1000006e
0x10000048: disp_bit_pat+60    cmp r4, #0
0x1000004a: disp_bit_pat+62    beq.n 0x1000005a
0x1000004c: disp_bit_pat+64    ldrb r2, [r3, #0]
0x1000004e: disp_bit_pat+66    adds r3, #1
0x10000050: disp_bit_pat+68    rors r2, r0
0x10000052: disp_bit_pat+70    str r2, [r1, #32]
0x10000054: disp_bit_pat+72    movs r2, #2
0x10000056: disp_bit_pat+74    cmp r2, r5
0x10000058: disp_bit_pat+76    beq.n 0x1000006e

Inner loop.
0x1000005a: disp_bit_pat+78    ldrb r6, [r3, #0]
0x1000005c: disp_bit_pat+80    adds r2, #2
0x1000005e: disp_bit_pat+82    rors r6, r0
0x10000060: disp_bit_pat+84    str r6, [r1, #32]
0x10000062: disp_bit_pat+86    ldrb r4, [r3, #1]
0x10000064: disp_bit_pat+88    adds r3, #2
0x10000066: disp_bit_pat+90    rors r4, r0
0x10000068: disp_bit_pat+92    str r4, [r1, #32]
0x1000006a: disp_bit_pat+94    cmp r2, r5
0x1000006c: disp_bit_pat+96    bne.n 0x1000005a
End of inner loop.

0x1000006e: disp_bit_pat+98    pop {r4, r5, r6, pc}

This is the disassembled code from version 5.   I have the critical loop
marked as "inner loop".   This is the code that runs slow.

10000014:   push {r4, r5, r6, r7, lr}
10000016:   ldr r2, [pc, #56] ;
10000018:   ldr r0, [pc, #56] ;
1000001a:   ldr r1, [r2, #0]
1000001c:   ldr r2, [r0, #0]
1000001e:   movs r5, #3
10000020:   ldr r0, [pc, #52] ;
10000022:   subs r5, r5, r1
10000024:   lsls r1, r2, #2
10000026:   ldr r3, [r0, r1]
10000028:   cmp r3, #0
1000002a:   ble.n 0x1000004e
1000002c:   lsls r6, r2, #5
1000002e:   adds r4, r6, r2
10000030:   lsls r6, r4, #3
10000032:   movs r4, #32
10000034:   subs r5, r4, r5
10000036:   ldr r7, [pc, #36] ;
10000038:   ldr r4, [pc, #36] ;
1000003a:   movs r3, #0
1000003c:   adds r6, r6, r2

Inner loop.
1000003e:   adds r2, r7, r6
10000040:   ldrb r2, [r2, r3]
10000042:   adds r3, #1  
10000044:   rors r2, r5
10000046:   str r2, [r4, #32] 
10000048:   ldr r2, [r0, r1] 
1000004a:   cmp r2, r3
1000004c:   bgt.n 0x1000003e
End of inner loop.

1000004e:   pop {r4, r5, r6, r7, pc}

My normal response to performance problems is to write the code in
assembler but the compiler code looks good and I probably could not
do any better.   Especially since the version 5 code looks really good
but runs really slowly.

And yes, I am absolutley positive that I an not running in "debug" mode!

Any insights would be greatly appreciated,