Timing issue (flash, GPIO, wait states)

lpcware · ‎06-15-2016

Content originally posted in LPCWare by jsl123 on Thu Oct 10 03:31:08 MST 2013
Hi!

First please have a look at the first picture. It shows the status on a GPIO pin for a 30MHz configured LPC810.
(Just added the "raw" files since the resolution was too bad...)

This is the result of the following code:

while (1) {
LPC_GPIO_PORT->SET0 = _BV (LED_LOCATION); \
LPC_GPIO_PORT->CLR0 = _BV (LED_LOCATION); \
LPC_GPIO_PORT->SET0 = _BV (LED_LOCATION); \
LPC_GPIO_PORT->CLR0 = _BV (LED_LOCATION); \
LPC_GPIO_PORT->SET0 = _BV (LED_LOCATION); \
LPC_GPIO_PORT->CLR0 = _BV (LED_LOCATION); \
LPC_GPIO_PORT->SET0 = _BV (LED_LOCATION); \
LPC_GPIO_PORT->CLR0 = _BV (LED_LOCATION); \
}

in complete absence of interrupts and/or timers
which results in the following assembler:

  24:   22a0            movs    r2, #160        ; 0xa0
  26:   0612            lsls    r2, r2, #24
  28:   2088            movs    r0, #136        ; 0x88
  2a:   0180            lsls    r0, r0, #6
  2c:   2304            movs    r3, #4
  2e:   218a            movs    r1, #138        ; 0x8a
  30:   0189            lsls    r1, r1, #6
  32:   5013            str     r3, [r2, r0]
  34:   5053            str     r3, [r2, r1]
  36:   5013            str     r3, [r2, r0]
  38:   5053            str     r3, [r2, r1]
  3a:   5013            str     r3, [r2, r0]
  3c:   5053            str     r3, [r2, r1]
  3e:   5013            str     r3, [r2, r0]
  40:   5053            str     r3, [r2, r1]
  42:   e7f6            b.n     32 <main+0x32>

That looks reasonable...

You can find the groups of 4 and in between there is the branch.
But what makes me wonder is the difference between the "set time" (70ns) and the "clear time" (30ns)! (It's not 33.3ns because of the limited resolution of my LA.

If it where caching or other internal artefacts, I would perhaps assume a delay on the *first* "set" but not on later ones. But here every "set" is "long" i.e 2 cycles and every "clear" is one cycle.

Only to be complete. The second image shows the same for the internal PLL and the dividers setup to deliver 24MHz.
So it's the same here...

Does anyone have any explanation for this?

Thanks for reading! Salut,
Jo"rg

lpcware · ‎06-15-2016

Content originally posted in LPCWare by cpldcpu on Fri Oct 11 20:03:19 MST 2013
Btw, wouldn't this be a good opportunity to use the integrated System tick timer in the Cortex M0+? Has anybody tried that?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by cpldcpu on Fri Oct 11 20:01:47 MST 2013
Could the observed issue also be a sampling artifact from the logic analyzer? I think it would be more accurate to time a larger number of instructions in a loop.

When I worked on cycle accurate code (see the WS2812 thread in this forum), the instruction timing behaved as expected once i set the waitstates to zero. Although I have to admit that I probably would not have noticed if only one or two cycles were added to my loop.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by rocketdawg on Fri Oct 11 07:29:15 MST 2013

Quote: cpldcpu

According to the LPC81X block diagram, the GPIO is not connected through the AHB bus. Single cycle port toggling should be possible both from flash and sram...

I see that. Thanks for pointing that out.
So it has to be the core stalling for a clock. I wonder if Starblue may be correct in that, it is an instruction fetch.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by jsl123 on Fri Oct 11 05:56:20 MST 2013
Hi cpldcpu, hi All!

Quote: cpldcpu

Quote:

0x0 -> 1 system clock flash access time

This corresponds to zero waitstates.

OK. But then again please have a look at the different pictures (@24MHz) above. There is one running from ram where you can see no additional cycles. One (thumb) instructions, one cycle.
And then running from flash which shows two instructions where one of them seems to run for an additional cycle!

If you say that this is the mentioned "1 system clock" for flash access then that would mean that running from ram takes _no_ time to fetch the instruction?
Hmm.
According to rocketdawg this is a 2-stage pipeline system. So running from ram can keep the pipeline completly filled while when running from flash there must be "gaps" in the pipeline while the core fetches from the flash (with "1 system clock"). So I'd call that a wait state in regard to zero wait states from ram? :-)
Or am I missing something?

Salut, Jo"rg

lpcware · ‎06-15-2016

Content originally posted in LPCWare by cpldcpu on Fri Oct 11 05:40:33 MST 2013

Quote: rocketdawg
Yea, remember the CM0+ is a 2 stage pipeline Von Neumann machine.
The AHB bus must switch between read and write and that should take one cycle unless it has been improved with CM0+.
so instructions are AHB reads.
write to a port means the bus must switch to write, then move the data.
then read instructions again
or something like that.

but like starblue, it is just a guess.

According to the LPC81X block diagram, the GPIO is not connected through the AHB bus. Single cycle port toggling should be possible both from flash and sram...

lpcware · ‎06-15-2016

Content originally posted in LPCWare by cpldcpu on Fri Oct 11 05:37:31 MST 2013

Quote:

0x0 -> 1 system clock flash access time

This corresponds to zero waitstates.

Well, with zero waitstates you will run both the flash and the core out of spec above 30Mhz. With one waitstate it's only the core. :)

Btw, today I noticed in the errata that the built in ROM functions do not work at with zero waitstate access. So be careful...

lpcware · ‎06-15-2016

Content originally posted in LPCWare by jsl123 on Fri Oct 11 05:32:14 MST 2013
Hi cpldcpu!

Can you elaborate a little on how you'd set the flash controller to zero wait states?

If I look into the "LPC800 User manual" (UM10601, rev. 1.2 from 14 March 2013) I can see on page 248:
In section 20.4.1 (Flash configuration register) FLASHCFG
bit 1:0 FLASHTIM:
0x0 -> 1 system clock flash access time
0x1 -> 2 system clock flash access time
0x2 and 0x3 -> reserved
bit 31:2 reserved

So how can you set zero wait states and why do you mention "up to 30 MHz" in your post as the LPC8xx familiy is only supposed to be clocked up to 30 MHz? :-)
Can it be that your info is from another device familiy?

Salut, Jo"rg

lpcware · ‎06-15-2016

Content originally posted in LPCWare by cpldcpu on Fri Oct 11 05:15:01 MST 2013
You can actually set the Flash in the LPC810 to zero waitstates up to 30Mhz, there should be no need to execute code from the SRAM to attain maximum speed.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by jsl123 on Fri Oct 11 03:42:23 MST 2013
Hi rocketdawg!

Thanks for your answer, but after having a look into the NXP datasheet and the ARM docu I think "starblue" was right.

BTW. The CM0+ has this feature called "single-cycle I/O port" which does the trick but only if the code runs from zero wait cycle memories (i.e. RAM).
The instruction fetches from flash still give you a wait cycle every second thumb instruction though. :-)

Salut, Jo"rg

lpcware · ‎06-15-2016

Content originally posted in LPCWare by jsl123 on Fri Oct 11 03:38:02 MST 2013
Hi starblue!

Yes. I had the same idea a few hours later and inserting a __asm("nop") justified this reason.
It's really the one wait cylce while fetching a word (32bits) from flash and then issueing *two* (16bit) thumb instructions.

So, thank you! Salut, Jo"rg

lpcware · ‎06-15-2016

Content originally posted in LPCWare by rocketdawg on Thu Oct 10 14:05:58 MST 2013
Yea, remember the CM0+ is a 2 stage pipeline Von Neumann machine.
The AHB bus must switch between read and write and that should take one cycle unless it has been improved with CM0+.
so instructions are AHB reads.
write to a port means the bus must switch to write, then move the data.
then read instructions again
or something like that.

but like starblue, it is just a guess.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by starblue on Thu Oct 10 09:57:12 MST 2013
Because data is fetched from flash 32 bits at a time, and each thumb instruction has 16 bits (just my educated guess).

You could try whether changing the alignment moves the delay to the other transition.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by jsl123 on Thu Oct 10 06:03:42 MST 2013
Hi!

More on this issue... :-)

If the same code runs from RAM the picture from this attachment results.
So this is clearly w/o wait states. The cursors measure exactly 3 on/off cycles which are 6 instructions and they take
250ns. Which leads to 24MHz!

But now the question remains: Why does the "set" command needs one wait state when run from flash while the "clear" needs no wait state?

Hmm.....

Salut, Jo"rg