Fastest way to iterate through array on M4

lpcware · ‎06-15-2016

Content originally posted in LPCWare by wlamers on Thu Sep 04 02:32:16 MST 2014
I have a time critical ISR that needs to iterate through an array of size 256 (preferably 1024 but 256 is minimum) and check if a value matches the arrays contents. A bool will be set to true is this is the case. MCU is a LPC4357, cortex M4 core, compiler GCC. I already have combined optimisation level 2 (3 is slower) and placing the function in RAM instead on flash. I also use pointer aritmetic and a for loop which does down counting instead of up (checking if i!=0 is faster than checking if i<256). All in all I end up with a duration of 12.5us which has to be reduced drastically to be feasible. This is the (pseudo) code I use now:

uint32_t i;
uint32_t *array_ptr = &theArray[0];
uint32_t compareVal = 0x1234ABCD;
bool validFlag = false;

for (i=256; i!=0; i--)
{
    if (compareVal == *array_ptr++)
    {
         validFlag = true;
         break;
     }
}

What would be the absolute fastest way to do this? Using inline assembly is allowed. Other 'less elegant' tricks also.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Pacman on Fri Sep 05 06:17:06 MST 2014
This is a real good discussion. :)

I'll just give one small hint as well:
Only sort / optimize the array when you're writing to it.
When there's no change, do nothing. Perhaps set a flag for the interrupt that it should relax.

If you need a value in the interrupt, have a pointer to the last value that you've found.

(Those hints may be completely irrelevant, as I do not know exactly what the job is about, but hopefully you can make use of some of this).

-I can't agree more on using the LDM/LDMIA instructions. You can find the assembly language instructions at the ARM Information Center; get both the Technical Reference Manual and the Generic User's Guide.

You may also want to enable loop-unrolling on your commandline for the specific source-file that contains your interrupt (though I think writing the interrupt in assembly code would be the best solution).

... I wonder... Which compiler version are you using ?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by starblue on Thu Sep 04 05:56:27 MST 2014

Quote: wlamers
btw it seems that there is an 'error' in the manual in figure 10 on page 36 depicting the AHB multilayer matrix. Is says 128kB LOCAL SRAM, but that is for the flashless parts only. The flash parts have 32 kB.

Looks like they had a mix-up with the diagrams. It is basically the same as on the page before, and it is even missing the flash.

Quote: wlamers

I also have several memcpy's in that that could be optimised. In fact I copy a uint64_t to a uint8_t array. Now I use:

memcpy(array_ptr, &theuint64val, 8);

Using

*(array_ptr + 2) = theuint64val;
*(array_ptr + 3) = theuint64val >> 8;
*(array_ptr + 4) = theuint64val >> 16;
*(array_ptr + 5) = theuint64val >> 24;
*(array_ptr + 6) = theuint64val >> 32;
*(array_ptr + 7) = theuint64val >> 40;
*(array_ptr + 8) = theuint64val >> 48;
*(array_ptr + 9) = theuint64val >> 56;

is slightly faster (8%).

But I think the cortex M4 has several instructions to copy multiple bytes at a time. Do you have a clue how to write this in assembly?

For aligned words the best is probably to load several registers with LDMIA and store them with STMIA. But I don't know what's best for unaligned bytes.

Jürgen

lpcware · ‎06-15-2016

Content originally posted in LPCWare by wlamers on Thu Sep 04 05:15:44 MST 2014

Quote: starblue

No, it has three separate buses, one for instructions (I-code) and two for data (D-code and System). In the user manual see the figure "AHB multilayer matrix connections".

I checked into this and (just by good luck) I have put the array in section 0x2000 0000 (AHB SRAM 32 kB section) which is only connected to the system bus and not the D-code bus and the pointer and test value in section 0x1000 0000 (local SRAM 32 kB section) which is connected to the D-code (and I-code) bus. Hence the data can be accessed simultanious. EDIT, Ohw btw it seems that there is an 'error' in the manual in figure 10 on page 36 depicting the AHB multilayer matrix. Is says 128kB LOCAL SRAM, but that is for the flashless parts only. The flash parts have 32 kB.

Quote: starblue

Given that the array is fixed I would go for the binary search approach.

Did that, total ISR time down to 4.24us. Hooray! And still there is potential to cut another 1us from that since I also have several memcpy's in that that could be optimised. In fact I copy a uint64_t to a uint8_t array. Now I use:

memcpy(array_ptr, &theuint64val, 8);

Using

*(array_ptr + 2) = theuint64val;
*(array_ptr + 3) = theuint64val >> 8;
*(array_ptr + 4) = theuint64val >> 16;
*(array_ptr + 5) = theuint64val >> 24;
*(array_ptr + 6) = theuint64val >> 32;
*(array_ptr + 7) = theuint64val >> 40;
*(array_ptr + 8) = theuint64val >> 48;
*(array_ptr + 9) = theuint64val >> 56;

is slightly faster (8%).

But I think the cortex M4 has several instructions to copy multiple bytes at a time. Do you have a clue how to write this in assembly?

Thing to note is that the starting address of the location in the array where to put the uint64_t is dynamic and therfore not word aligned (at least I can not guantee that).

If we could speed this up also I would be really happy with the result.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by starblue on Thu Sep 04 04:30:50 MST 2014

Quote: wlamers

Quote: starblue
Are your program and data in different RAMs, so that they can be accessed in parallel?

Both are at least in a different section.

That's probably not enough, you need to tell the linker to put the code section in a different RAM area.

Quote: wlamers
But I think that I remember from the "Definitive guide' book that the M4 can only access flash and all internal RAM in parallel.

No, it has three separate buses, one for instructions (I-code) and two for data (D-code and System). In the user manual see the figure "AHB multilayer matrix connections".

Quote: wlamers
Strange thing is that when I place the function in RAM it is slightly faster (5-10%).

In my experience code in RAM is about three times faster than in flash (with 9 wait states at 200MHz). But for small pieces of code like this the flash may be cached most of the time, so it might already be quite fast.

Given that the array is fixed I would go for the binary search approach.

Jürgen

lpcware · ‎06-15-2016

Content originally posted in LPCWare by wlamers on Thu Sep 04 04:05:15 MST 2014

Quote: wmues
I would sort the array, and use binary search. You will need 8 checks for a 256er array.

That should be possible I guess, since I could sort the array on forehand. This should take only 10 iterations for the 1024 case. Good point, thanks.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by wlamers on Thu Sep 04 04:03:42 MST 2014

Quote: starblue
In assembler you could unroll the loop a bit and use LDMIA to load 4 or 8 values into registers before you compare them.

Are your program and data in different RAMs, so that they can be accessed in parallel?

Both are at least in a different section. But I think that I remember from the "Definitive guide' book that the M4 can only access flash and all internal RAM in parrallel. Strange thing is that when I place the function in RAM it is slighly faster (5-10%).

I could unroll the loop in e.g. 8 checks but I have no clue in how to set this up is assembly. Can you hint me in the right direction?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by wlamers on Thu Sep 04 04:00:49 MST 2014

Quote: TheFallGuy
Most instructions are single-cycle, BUT
- a read from memory may take several cycles
- a branch will cause a pipeline flush

I suggest you step through the code and look at the "cycle" and "cycledelta" registers - these tell you how many cycles have been executed by the core.

Note - your *fastest* code may be to unroll the loop completely (i.e. don't have a loop at all)

Thanks for the tip. This is the result:

1000033c:   ldr.w   r0, [r2, #4]!  2 cycles
10000340:   ldr     r1, [sp, #12]  2 cycles
10000342:   cmp     r0, r1  1 cycle
10000344:   beq.w   0x10000756  1 cycle
10000348:   subs    r3, #1  1 cycle
1000034a:   bne.n   0x1000033c  1cycle

Total of 8 cycles.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by wmues on Thu Sep 04 03:42:02 MST 2014
I would sort the array, and use binary search. You will need 8 checks for a 256er array.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by TheFallGuy on Thu Sep 04 03:37:52 MST 2014
Most instructions are single-cycle, BUT
- a read from memory may take several cycles
- a branch will cause a pipeline flush

I suggest you step through the code and look at the "cycle" and "cycledelta" registers - these tell you how many cycles have been executed by the core.

Note - your *fastest* code may be to unroll the loop completely (i.e. don't have a loop at all)

lpcware · ‎06-15-2016

Content originally posted in LPCWare by starblue on Thu Sep 04 03:24:10 MST 2014
In assembler you could unroll the loop a bit and use LDMIA to load 4 or 8 values into registers before you compare them.

Are your program and data in different RAMs, so that they can be accessed in parallel?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by wlamers on Thu Sep 04 03:19:13 MST 2014
Good idea to think of if that way.

12.4us is with some other overhead, but the for loop itself should not take more than 6-8 us to be feasible. Clock is 192MHz. Than 192e6*6e-6=1152/256 = 4 instructions.

One down count
One check id i!=0
One check if val==*ptr
One increasing of the ptr

Should be possible I guess. Problem is I have very limited assembly knowledge to set this up.

This is the unoptimized disassembly:

                   
1a004698:   ldr.w   r3, [r7, #144]  ; 0x90
1a00469c:   adds    r2, r3, #4
1a00469e:   str.w   r2, [r7, #144]  ; 0x90
1a0046a2:   ldr     r2, [r3, #0]
1a0046a4:   ldr     r3, [r7, #100]  ; 0x64
1a0046a6:   cmp     r2, r3
1a0046a8:   bne.n   0x1a0046b2
1a0046aa:   movs    r3, #1
1a0046ac:   strb.w  r3, [r7, #150]  ; 0x96
1a0046b0:   b.n     0x1a0046c4
1a0046b2:   ldr.w   r3, [r7, #152]  ; 0x98
1a0046b6:   subs    r3, #1
1a0046b8:   str.w   r3, [r7, #152]  ; 0x98
1a0046bc:   ldr.w   r3, [r7, #152]  ; 0x98
1a0046c0:   cmp     r3, #0
1a0046c2:   bne.n   0x1a004698

This is the loop of the optimized disassembly:

1000033c:   ldr.w   r0, [r2, #4]!
10000340:   ldr     r1, [sp, #12]
10000342:   cmp     r0, r1
10000344:   beq.w   0x10000756
10000348:   subs    r3, #1
1000034a:   bne.n   0x1000033c

The optimized code is just 5 instructions but it takes much longer than that. Probably since some instructions take more than one clock cycle? Any clues?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by TheFallGuy on Thu Sep 04 02:46:54 MST 2014
How about turning this around, to work out if it is even possible?

What are you constraints? You say that 12.5uS has to be reduced dramatically - to what? How quickly must this be done?
What is your clock speed?

Then, using simple math, you can work out how many instructions you have to play with. Divide by 256 (or 1024) and you will know how many instructions you have, per iteration. If it is less than one, you have a problem...

Fastest way to iterate through array on M4

Fastest way to iterate through array on M4

LPC43xx