Question on delay loops.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Wed Jun 16 08:55:40 MST 2010
[SIZE=4]I have some code that spins in a delay loop. It is written as inline assembly. I am getting eratic results with the amount of time delayed. The difference is one loop runs 50% slower. [/SIZE]

[SIZE=4]This code runs fast;[/SIZE]

[FONT=Courier New][SIZE=4]000005b2 <dly2_490>:[/SIZE][/FONT]
[FONT=Courier New][SIZE=4]5b2: 3b01 subs r3, #1[/SIZE][/FONT]
[FONT=Courier New][SIZE=4]5b4: d1fd bne.n 5b2 <dly2_490[/SIZE]>[/FONT]

[SIZE=4]This code runs 50% slower;[/SIZE]

[FONT=Courier New][SIZE=4]000005be <dly2_499>:[/SIZE][/FONT]
[FONT=Courier New][SIZE=4]5be: 3b01 subs r3, #1[/SIZE][/FONT]
[FONT=Courier New][SIZE=4]5c0: d1fd bne.n 5be <dly2_499>[/SIZE][/FONT]

[FONT=Courier New][FONT=Times New Roman][SIZE=4]The difference seems to be the alignment. Both pieces of code work correctly with the only difference being the delay in the loop. I can easily change between the two situations by puting an extraneous line of code earlier in the routine.[/SIZE][/FONT][/FONT]

[SIZE=4]I am thinking that this might have something to do with the alignment of the instructions in the 32 bit wide memory. Could there be a instruction queue issue here?[/SIZE]

[SIZE=4]As I make minor changes in the routine the delay effect switches back and forth. What I want is for the loop to spin at the fastest rate so that the timming is produced with the highest resolution.[/SIZE]

[SIZE=4]I can put a aligned attribute on the function to insure that it stays in one alignment position. I could then tune the opcodes to produce the desired alignment of the loop. I would hope that this would insulate me from changes that I make elsewhere in the program.[/SIZE]

[SIZE=4]Can anyone shed some light on this mistery?[/SIZE]

[SIZE=4]Thanks,[/SIZE]
[SIZE=4]Pete.[/SIZE]

lpcware · ‎06-15-2016

Content originally posted in LPCWare by jesari on Sat Jul 11 02:22:10 MST 2015
A pipeline graph...

lpcware · ‎06-15-2016

Content originally posted in LPCWare by jesari on Fri Jul 10 14:55:42 MST 2015
Hi,
I found the same issue in an LPC1114 (cortex-m0), and I think I've got an explanation for this behaviour:

- It seems the flash is read 16-bytes (8-halfwords) at a time. So, it takes 3 cycles for the reading, but then the 128 bits are latched and they can be read in a single cycle. Older LPCs, with MAM, also used 128-bit wide flashes and newer cortex-m processor are probably the same...

- The code loop is 2 halfwords long and, if it fits in a flash chunk, it will be executed at maximum speed because the flash is read only once. Following iterations fetch the op-codes directly from the latches.

- So, why if the loop is located in a 0xXXXXXXXA address the execution is slow? That's due to pipelining: When the BNE instruction is executed the PC is one halfword ahead reading a potential op-code to be executed. If this dummy fetch crosses the chunk boundary it forces a flash read that takes 2 extra cycles, and worse: the loop code is lost and has to be read from the flash again and again.

I'm not sure if this is really true, maybe some people at NXP know the details that are missing in the User Manuals and can confirm or deny it...

lpcware · ‎06-15-2016

Content originally posted in LPCWare by rkiryanov on Mon Jun 28 07:38:19 MST 2010

Quote: micrio
FLASHCFG only helps if you want to run the CPU at a slower speed. If the CPU is slowed then it can require fewer (slowed) cycles to access the flash.

Quote:
3 system clocks flash access time (for system clock frequencies of up to 50 MHz).

Quote: micrio
The prefetch unit caused the odd delay behavior that was discussed earlier.

What is the "odd access"?

Quote:
ARM Cortex-M0:
Thumb instruction A halfword that specifies an operation for an ARM processor in Thumb state to perform. Thumb instructions must be halfword-aligned.

There are only halword instructions in your listing. Your delays are caused by misses of prefetch unit.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Mon Jun 28 07:28:40 MST 2010
FLASHCFG only helps if you want to run the CPU at a slower speed. If the CPU is slowed then it can require fewer (slowed) cycles to access the flash.

Perhaps I used the wrong term, I said cache but should have said prefetch unit.

My question was does the prefetch unit operate on access to RAM? The prefetch unit caused the odd delay behavior that was discussed earlier. The explanation made perfect sense. I want to know if I can expect the same behavior from RAM access?

Thanks,
Pete.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by rkiryanov on Mon Jun 28 06:55:01 MST 2010

Quote: micrio
Another question concerning the delay associated with caching

There is no caching in ARM v6.

Quote: micrio
Does RAM run at CPU speed (48 MHz) in a LPC111X family chip?

Yes.

Quote: micrio
Would I expect a similar delay when indexing through an array that is resident in RAM?

[COLOR=#000000]You can expect delays caused by fetching instructions from flash memory.[/COLOR] FLASHCFG does not help?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Mon Jun 28 06:45:03 MST 2010
Another question concerning the delay associated with caching; Would I expect a similar delay when indexing through an array that is resident in RAM? In other words, would I expect to see the same mix of fast and slow access times as I stepped through RAM reading words sequentially?

Does RAM run at CPU speed (48 MHz) in a LPC111X family chip?

Thanks,
Pete.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by rkiryanov on Tue Jun 22 08:08:00 MST 2010

Quote: NXP_Europe
Now when we de-align the instruction with 5 bytes, the instruction will be divided over to addresses, so a double fetch is needed to get the instruction in the ALU

ARm Cortex-M3:

A 3-word entry Prefetch Unit (PFU). One word is fetched at a time. This can be two Thumb instructions, one word-aligned Thumb 32-bit instruction, or the upper/lower halfword of a halfword-aligned Thumb 32-bit instruction with one Thumb instruction, or the lower/upper halfword of another halfword-aligned Thumb 32-bit instruction. All fetch addresses from the core are word aligned. If a Thumb 32-bit instruction is halfword aligned, two fetches are necessary to fetch the Thumb 32-bit instruction. However, the 3-entry prefetch buffer ensures that a stall cycle is only necessary for the first halfword Thumb 32-bit instruction fetched.

ARM Cortex-M0:

Thumb instruction A halfword that specifies an operation for an ARM processor in Thumb state to perform. Thumb instructions must be halfword-aligned.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by NXP_Europe on Tue Jun 22 07:52:57 MST 2010
Gentlemen,
First of all I need to emphasize that the M3 and M0 has for this case some differences. The M3 has a Harvard architecture, with separate data and instruction buses, where as the M0 has a von Neumann architecture, a combined data and instruction bus. So don't confuse this.

The 5:3 ratio you see in fast and slow operation is easily explained. Most of the M3/M0 instructions are 16-bits, the instruction bus is 32 bits right?
So if we now de-align the instruction in memory with 1 byte, the instruction will stay in the same address so with only one fetch, the instruction is loaded in the ALU.
Now when we de-align the instruction with 5 bytes, the instruction will be divided over to addresses, so a double fetch is needed to get the instruction in the ALU.
See the attached picture, this might help to explain this phenomena.

Kind regards,

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Sat Jun 19 15:54:01 MST 2010
I found this in a NXP app note;

[LEFT]　
[SIZE=2]The Cortex-M3 provides a separate bus for instruction access (ICode) and data access (DCode) in the memory space. The flash accelerator includes an array of eight 128-bit buffers to store both instructions and data. During linear code execution, the next four 32-bit instructions are stored in the code instructions buffer. Other buffers are used for storing instructions and data at code branches. The replacement strategy of the flash accelerator attempts to maximize the chances that potentially reusable information is retained until it is needed again. [/SIZE][/LEFT]

[LEFT]This explains the issues that I am running into. However the Cortex M0 documents do not mention this feature. It would have saved some banging of my head on the wall.[/LEFT]

[LEFT][SIZE=2][SIZE=3]Pete.[/SIZE][/LEFT]
[/SIZE]

lpcware · ‎06-15-2016

Content originally posted in LPCWare by rkiryanov on Thu Jun 17 11:00:49 MST 2010
17.10 Flash memory access, FLASHCFG

On LPC2000 it was MAM :)

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Thu Jun 17 10:51:36 MST 2010
Your suggestion that the MAM system may cause the problem seems logical.
However I don't see any mention of a MAM in the Cortex M0 chips.
Could it be there but not listed in the spec?

Pete.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by rkiryanov on Wed Jun 16 21:43:34 MST 2010
1. See MAM (Memory Accelerator Module) settings.
2. Use ram-function.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by fastmapper on Wed Jun 16 19:28:41 MST 2010
I have done some of my own timing analysis on a LPC1343 running at 72MHz. Here is a table of the clock cycles per loop that I observed depending upon the address at the top of the loop:

Address  Cycles Per Loop
-------  ---------------
  xx0           3
  xx2           3
  xx4           3
  xx6           3
  xx8           7
  xxA           7
  xxC           6
  xxE           7

It looks like my results are similar, but not the same as yours.

I expect that different loop timings occur due to the operation of instruction prefetch and the varying states that it will be in with the different code addresses.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Wed Jun 16 12:12:11 MST 2010
It is a LPC1114/301.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by CodeRedSupport on Wed Jun 16 11:10:11 MST 2010
What MCU are you using?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Wed Jun 16 09:58:18 MST 2010
[FONT=Courier New]I pushed the loop up through memory by putting a NOP opcode before it. Here are the results;[/FONT]

[FONT=Courier New]This is the loop in question.[/FONT]

[FONT=Courier New]0000056c <dly_449>:[/FONT]
[FONT=Courier New]56c: 3801 subs r0, #1[/FONT]
[FONT=Courier New]56e: d1fd bne.n 56c <dly_449>[/FONT]

[FONT=Courier New]The position of the loop in memory and the speed of execution;[/FONT]
[FONT=Courier New]56c slow[/FONT]
[FONT=Courier New]56e slow[/FONT]
[FONT=Courier New]570 fast[/FONT]
[FONT=Courier New]572 fast[/FONT]
[FONT=Courier New]574 fast[/FONT]
[FONT=Courier New]576 fast[/FONT]
[FONT=Courier New]578 fast[/FONT]
[FONT=Courier New]57a slow[/FONT]
[FONT=Courier New]57c slow[/FONT]
[FONT=Courier New]57e slow[/FONT]
[FONT=Courier New]580 fast[/FONT]
[FONT=Courier New]582 fast[/FONT]

[FONT=Courier New]There seems to be a memory position issue here. The repeating cycle of 16 bytes makes some sense. I can't explain the 5:3 ratio. [/FONT]
[FONT=Courier New]Any thoughts?[/FONT]

[FONT=Courier New]Pete.[/FONT]