Running from RAM

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Wed Jun 22 08:54:02 MST 2011
I am trying to run a small routine from RAM. The code works perfectly when running in flash. It has a loop in it that I want to run as fast as possible.
I mark the function with;
[B][SIZE=2][COLOR=#7f0055][SIZE=2][COLOR=#7f0055]__attribute__[/COLOR][/SIZE][/COLOR][/SIZE][/B][SIZE=2]((section([/SIZE][SIZE=2][COLOR=#2a00ff][SIZE=2][COLOR=#2a00ff]".data"[/COLOR][/SIZE][/COLOR][/SIZE][SIZE=2])))[/SIZE]
and the generated code looks OK.

There is a XXX_veneer function created that contains;
2030: e51ff004 ldr pc, [pc, #-4] ; 2034 <__disp_bit_pat_veneer+0x4>
which I assume is OK.

My function seems to be in RAM as expected.

I get a HardFault at the call to the veneer. The actual call is;
disp_bit_pat ();
1856: f000 ebec blx 2030 <__disp_bit_pat_veneer>

I though that the BLX opcode was an indirect call through a register and not
an immediate value. Could this be the problem? Do I need to code this in assembly?

I must compile and debug with full optimization, anything less is too slow
in my application. It just won't work.

This is pretty simple and I am, no doubt, doing something wrong.
This is not moving the vector table, it is not a boot loader.

Thanks,
pete.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Thu Jun 23 09:40:50 MST 2011
I should have mentioned my processor in the initial post, it is a LPC1114/301.
That old thread mentioned was also my posting. In that case I was able to
get top performance in a tight delay loop by adjusting the code alignment.
There was a great explanation of the odd behavior that I was seeing.

In this case the loop was a little too big to fit in the 3 cache prefetch registers
so moving to RAM was the solution to the problem.

Pete.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by igorsk on Thu Jun 23 03:43:37 MST 2011

Quote: micrio
A little more information;
I am seeing almost a 2X speedup with the code in RAM vs. the same code in flash. This is a pretty tight loop so I don't know how general these results
would be. Does the prefetch cache operate when running in RAM or only
when running in flash?

What is your processor? If it has a flash accelerator aka MAM (not sure if the LPC11xx parts have it), you might get a speedup by aligning your loop to 16-byte boundary (MAM cache line size) and making sure it does not access flash (i.e. doesn't use LDR for constants). See here.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by CodeRedSupport on Thu Jun 23 01:26:57 MST 2011
For future reference, we have now created an FAQ on this subject, with attached example:

http://support.code-red-tech.com/CodeRedWiki/CodeInRam

Regards,
CodeRedSupport

lpcware · ‎06-15-2016

Content originally posted in LPCWare by CodeRedSupport on Thu Jun 23 00:26:17 MST 2011

Quote: igorsk
BLX has two forms. The register form is indeed an indirect call which uses the low bit of the value as the Thumb mode bit. The memory address form you see here [I]switches[/I] between ARM and Thumb mode. Since Cortex-M supports only Thumb mode, switching to ARM mode causes the hard fault. What I suspect happens is that the linker but does not mark the veneer as Thumb code (or creates an ARM veneer), and so the branch gets converted to BLX as if jumping to ARM code.

I found a similar issue described here but it's pretty old so probably should have been fixed by now.

This is indeed the issue I was referring to. An updated linker which fixes this issue is included in Red Suite 4 and (the forthcoming) LPCXpresso4.

Even in the updated tools, there is actually still an advantage to using the function pointer mechanism rather than relying on the linker created veneer. This veneer will not have any source level debug information associated with it, so that if you try to step (at the source level) into a call to your ram code, typically the debugger will step over it.

You can work around this by single stepping at the instruction level, setting a breakpoint in your RAM code, or by changing the function call from a direct one to a call via a function pointer.

Quote: micrio

I get an assembler error;
[LEFT][SIZE=2]cc0Z4dnn.s: Assembler messages:[/SIZE]
[SIZE=2]cc0Z4dnn.s:118: Warning: ignoring changed section attributes for .data[/SIZE][/LEFT]

This is because you are placing your code into a data section, which by default does not expect to have the attributes associated with code. As this is what you intend you can ignore the warning. If the warning is a concern, then you will have to avoid the trick of placing the code into a data section, and instead modify your linker script and initialisation code to copy with placing/copying the code into ram yourself.

Regards,
CodeRedSupport

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Wed Jun 22 15:44:59 MST 2011
I mis-spoke in my last post. The speed improvement was 54% not 2X (100%)
as I stated. But that was the speed up I was looking for.

Pete.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Wed Jun 22 13:14:07 MST 2011
A little more information;

I am seeing almost a 2X speedup with the code in RAM vs. the same code in flash. This is a pretty tight loop so I don't know how general these results
would be. Does the prefetch cache operate when running in RAM or only
when running in flash?

I get an assembler error;
[LEFT][SIZE=2]cc0Z4dnn.s: Assembler messages:[/SIZE]
[SIZE=2]cc0Z4dnn.s:118: Warning: ignoring changed section attributes for .data[/SIZE][/LEFT]

[LEFT][SIZE=2]In spite of what the error says, the "section(".data")" attribute is being processed. [/SIZE]
[SIZE=2]The section does end up in RAM. If I remove the attribute then the code ends up[/SIZE]
[SIZE=2]in flash.[/SIZE][/LEFT]

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Wed Jun 22 12:20:10 MST 2011
Yes, the indirect call works. I did see a considerable speed up in my loop. Also, there was no veneer code generated.

Thanks,
Pete.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by igorsk on Wed Jun 22 11:55:06 MST 2011

Quote: micrio

I get a HardFault at the call to the veneer. The actual call is;
disp_bit_pat ();
1856: f000 ebec blx 2030 <__disp_bit_pat_veneer>

I though that the BLX opcode was an indirect call through a register and not
an immediate value. Could this be the problem? Do I need to code this in assembly?

BLX has two forms. The register form is indeed an indirect call which uses the low bit of the value as the Thumb mode bit. The memory address form you see here [I]switches[/I] between ARM and Thumb mode. Since Cortex-M supports only Thumb mode, switching to ARM mode causes the hard fault. What I suspect happens is that the linker but does not mark the veneer as Thumb code (or creates an ARM veneer), and so the branch gets converted to BLX as if jumping to ARM code.

I found a similar issue described here but it's pretty old so probably should have been fixed by now.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by CodeRedSupport on Wed Jun 22 09:33:43 MST 2011
Try changing your function call to be via a function pointer to remove the need for the linker long branch veneer. Thus for example, change the call from:

 foo();

to:

  static void (*ramfunc)(void) = &f00; 
  (*ramfunc)();

or similar.

From memory this sounds like a known issue with the Cortex-M0 long branch veneer support in the version of the linker used in LPCXpresso3. Normally this is not a problem, as typically you run all code out of flash (and no long branch veneers are required). I'll post some more information later when I've had chance to confirm.

Regards,
CodeRedSupport

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Wed Jun 22 09:26:06 MST 2011
There is a long thread on this subject concerning the operation of the cache and flash.
http://knowledgebase.nxp.com/showthread.php?t=460

I don't know for sure that it will be faster but I would like to try.

In that previous thread I was able to get maximum speed by aligning the code such that is fit in the 3 cache registers. This is a LPC 1114/301 which I forgot to mention. This new routine has loops that are too big to fit in the cache.

Pete.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by TheFallGuy on Wed Jun 22 09:17:10 MST 2011
What makes you think it will run faster from RAM? The Flash has prefetch buffers to enable it to run most of the time at zero wait-state, so I don't see what benefit you can gain.