Code alignment interfering on instruction temporization. Is it a Bug of the chip or is it normal?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Danjovic on Sat Jul 27 16:33:10 MST 2013
Hi

I am writing a video generating routine for the LPC1111/201 based on the Systick interrupt for generating the Hsync timing.

For pulse temporization I am writing a delay loop in assembly, so I can have a predictable temporization.

But I have noticed a strange behavior of the chip regarding the changing in execution timing of the delay loops that is related to the lenght of code generated. Is it normal? Let me explain better:

This is the structure of my code. For simplicity I have shown only the HSYNC routine but I have another two routines for generating VSYNC and Horizontal Equalization pulses.

main () {
  SystemInit();
  LPC_IOCON->PIO0_7 = (0x0) + (0<<3) + (0<<5);
  LPC_GPIO0->MASKED_ACCESS[1<<7] = 0;
  LPC_GPIO0->DIR = LPC_GPIO0->DIR | (1<<7);
  SysTick_Config(H_LINE_TICKS); // number of ticks to interrupt at 63,5us
  for (;;);
}

void SysTick_Handler(void)
{
H_Sync();
}

__inline void H_Sync(void)
{
// H_sync pulse 4.7us
LPC_GPIO0->DATA &= ~(1<<7);
asm("push {r1}\n\t"
"mov r1, #22\n\t"
"Loophs1:\n\t"
"subr1, r1, #1\n\t"
"cmpr1, #0\n\t"
"bneLoophs1\n\t"
"pop {r1}\n\t"
);
    LPC_GPIO0->DATA |= (1<<7);
}

The code above generates a Hsync pulse with a duration of 4.7us at each SysTick Interrupt.

Well, the problem began to show when I have added code for changing a second pin at the beginning of code in 'main()'.

main () {
  SystemInit();
  LPC_IOCON->PIO0_7 = (0x0) + (0<<3) + (0<<5);
  LPC_GPIO0->MASKED_ACCESS[1<<7] = 0;
  LPC_GPIO0->DIR = LPC_GPIO0->DIR | (1<<7);

  LPC_IOCON->PIO0_9 = (0x0) + (0<<3) + (0<<5);
  LPC_GPIO0->MASKED_ACCESS[1<<9] = 0;
  LPC_GPIO0->DIR = LPC_GPIO0->DIR | (1<<9);


  SysTick_Config(H_LINE_TICKS); // number of ticks to interrupt at 63,5us
  for (;;);
}

when I did that, the pulse width changed from 4.7us to 2.8us!!! After some time investigating I have found that if I put a pair or NOPS for stuffing the main, the delay loops became normal again.

main () {
  SystemInit();
  LPC_IOCON->PIO0_7 = (0x0) + (0<<3) + (0<<5);
  LPC_GPIO0->MASKED_ACCESS[1<<7] = 0;
  LPC_GPIO0->DIR = LPC_GPIO0->DIR | (1<<7);

  LPC_IOCON->PIO0_9 = (0x0) + (0<<3) + (0<<5);
  LPC_GPIO0->MASKED_ACCESS[1<<9] = 0;
  LPC_GPIO0->DIR = LPC_GPIO0->DIR | (1<<9);

  asm("nop");
  asm("nop");

  SysTick_Config(H_LINE_TICKS); // number of ticks to interrupt at 63,5us
  for (;;);
}

If I put only 1 'NOP' the timing stills wrong.
If I put 2 to 5 'NOPS' the timing is correct.
If I put 6 to 9 'NOPS' the timing is again wrong
if I put 10 'NOPs' the timing is again correct.

Do anybody went into a situation like that? What can be causing this behavior? Is it normal for the LPC1111 or other Cortex M0 chips?Is there any way to aovid this to happen, maybe a configuration option?

I think it is worth to mention that the Systick operation is normal (do not present any changing due to NOPS).

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Danjovic on Sat Aug 03 15:39:29 MST 2013
Thanks!!
The post was very elucidative.

Best Regards

lpcware · ‎06-15-2016

Content originally posted in LPCWare by lpcxpresso-support on Wed Jul 31 00:57:42 MST 2013
Just for information, this subject has come up a few times previously in the old LPCXpresso forum. For example...

http://www.lpcware.com/content/forum/question-on-delay-loops
[LPCWare import version]

http://knowledgebase.nxp.com/showthread.php?t=460]
[Original forum version]

Regards,
LPCXpresso Support

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Danjovic on Tue Jul 30 21:40:34 MST 2013
Just sharing the first video screen produced by my hardware (a LPC1111/201 with the aid of a couple of diodes and resistors)

Video Image

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Danjovic on Mon Jul 29 21:41:46 MST 2013
I've moved the delay function for RAM and it seems to have solved the problem. The timing is now now stable.

At the end it took only 4 words from RAM. I can live with that.
I´ve created a function based on the study of a post from the STM32 forum (thanks Sam). The timing parameter is being passed straight through r0.

void test_func(int vezes) __attribute__ ((__section__(".data.ramfunc")));

void test_func(int vezes){
asm(//"ldr r0,=40\n\t"
"L1:\n\t"
"subr0, r0, #1\n\t"
"cmpr0, #0\n\t"
"bneL1\n\t"
);
}

Now the waveform generating code looks like this:

// Horizontal Equalizing Pulse
__inline void H_Eq(void)
{
// H_equ pulse 2.3us
Sync_ON; //LPC_GPIO0->DATA &= ~(1<<7);
    test_func(14);
    Sync_OFF; //LPC_GPIO0->DATA |= (1<<7);
    // wait half line - 2.3us (29.45 us)
    test_func(250);
// Second H_equ pulse 2.3us
Sync_ON; //LPC_GPIO0->DATA &= ~(1<<7);
    test_func(14);
    Sync_OFF; //LPC_GPIO0->DATA |= (1<<7);
}

Thank you wmues, sorry for not considering you idea at first glance; and thank also Sam Gibson from embedded group in Linkedin.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by wmues on Mon Jul 29 02:28:20 MST 2013
I don't think that this is a problem of the CPU core: it's a problem of the flash interface.
Buffering and caching are common technics these days, and your code should not count on the absence of these technics. As the read performance of flash chips is slow, the internal flash has a large word width (64 bits ?), and a prefetch engine.

You should isolate the time-critical code into a subroutine, and use the linker script file to place these subroutine at a fixed address. So the execution timing will not depend on the location of the code, but only on the content of the code.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Danjovic on Sun Jul 28 19:44:23 MST 2013
Thanks for the suggestion Wolfgang, but unfortunately I can not afford to spare RAM for running instructions, since my project is a video generator and I need to use almost all RAM.

The nops were only a controlled way of shifting the instructions alongside the memory. I think that the problem is not related to a large boundary of memory, but with regions of 4 instructions. So I assume that the problem is that the branches are taking different time to execute.

I still can't believe that such thing could happen. In any other micro-controller that I have already worked the instruction execution time is well documented in terms of CPU Clock Cycles. So, for the LPC1111 I was expecting that each instruction to take exactly the the number of clock cycles that is specified on the data sheet.

That's why I think that this might be a bug on the Chip. Do you know if any other M0 behaves the same way?

Regards

lpcware · ‎06-15-2016

Content originally posted in LPCWare by wmues on Sun Jul 28 13:13:48 MST 2013
I think that this problem is related not to NOPS, but to execution addresses.

The internal flash has some sort of read buffer, and if the whole loop is inside the address range of the read buffer, code execution is fast because the buffer already has the code.

You should copy your code into internal RAM to avoid this problem.

best regards

Wolfgang

Code alignment interfering on instruction temporization. Is it a Bug of the chip or is it normal?

Code alignment interfering on instruction temporization. Is it a Bug of the chip or is it normal?

LPC11xx