ISR running from RAM?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Brinkand on Tue May 18 23:37:30 MST 2010
I am using the ADC in the LPC1111 to sample one channel at up to 264ksps. I use an interrupt service routine to receive the data, do a few calculations and store them in a buffer. Currently the code is running 40ksps, but I suppose I will need some performance optimization to go to 264ksps.

I am considering moving the ISR to RAM to avoid the wait states associated with flash. But will the processor add wait states when executing from RAM? Section 3.10 in the user manual suggests that wait states only apply to flash, but I guess it depends on the implementation.

To run from RAM, I suppose I need to do this:
1: Locate the ISR in the flash
2: Copy the ISR instruction code to RAM
3: Point the interrupt vector for the ADC to RAM.

I am not sure whether I will need to modify the compiler/linker in order to make the code executable from a different location - maybe something like relative addressing?

Comments are highly appreciated.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by NXP_USA on Thu Jun 03 13:52:18 MST 2010

Quote: domen
To put a function in RAM you could just mark it __attribute__((section(".data"))) (gcc specific). And that's it, nothing else needs to be done.

Haven't tested on lpc1343, but on stm32f103 (also cortex-m3), running from RAM was slower than running from FLASH (FLASH has a prefecher, and there are separate busses for RAM and FLASH). Be sure to time it, don't just assume RAM is faster.

Good suggestion for a way to move the ISR itself.

To copy the vector table from Flash to RAM, you could use something like the code below. It demonstrates copying the vector table, modifying it, and updating the location that is used for the vector table.

Although moving code to RAM may not always increase performance, it is useful technique when implementing bootloaders.

void SysTick_Handler(void)
{
      msTicks++;
}

void Slow_SysTick_Handler(void)
{
      static int slow;

      slow++;
      if((slow%4) == 3)
      {
            msTicks++;
      }
}

/* Declare pFunc_t as a function pointer to a function with no parameters typedef */
typedef void (*pFunc_t)(void);
/* Declare a pointer to an array of function pointers that points to RAM */
#define pfRAMVectors ((pFunc_t *)0x10000000)

int main (void) {

...

// Step 1: modify linker script to leave 0x200 bytes of RAM free
// In LPCXpresso this is done in the projectname_Debug_mem.ld file
// MEMORY
// {
//   /* Define each memory region */
//   MFlash32 (rx) : ORIGIN = 0x0, LENGTH = 0x8000 /* 32k */
//   RamLoc8 (rwx) : ORIGIN = 0x10000200, LENGTH = 0x1E00 /* 8k */
// }

// 2: Copy current vectors to RAM.
// memcpy( destination, source, length in bytes )
memcpy( (void *)0x10000000, (void *)0x00000000, 0x200);
// 3: Modify vector table- redirect SysTick interrupt
pfRAMVectors[15] = Slow_SysTick_Handler;
// 4: Switch over to modified vector table
LPC_SYSCON->SYSMEMREMAP = 1;

...
}

lpcware · ‎06-15-2016

Content originally posted in LPCWare by domen on Wed Jun 02 15:53:36 MST 2010
To put a function in RAM you could just mark it __attribute__((section(".data"))) (gcc specific). And that's it, nothing else needs to be done.

Haven't tested on lpc1343, but on stm32f103 (also cortex-m3), running from RAM was slower than running from FLASH (FLASH has a prefecher, and there are separate busses for RAM and FLASH). Be sure to time it, don't just assume RAM is faster.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by curtvm on Wed Jun 02 04:20:24 MST 2010

Quote:
How do You move the function to RAM?

By creating a function with the section attribute as I did, the function is placed in the .data section which gets initialized by the C startup code just like 'normal' initialized data (flash data copied to ram).

I don't think I would wait for an answer. I would put the isr in ram and time it. Everything else is theory.

My uneducated guess says you will not see much gain.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Brinkand on Wed Jun 02 01:46:44 MST 2010
Thank You all for Your comments.

HTH: I would expect wait states to be eliminated when running from RAM. What would make running from RAM slower than Flash? According to user manual section 3.10, 3 wait states are used when running from flash at max CPU frequency. This would normally only cause delays when branching, but I assume code can be read from RAM with no delay?

Curtvm: Thank You very much for this input. It would have taken me weeks to find by myself. How do You move the function to RAM? Will the linker create code for moving the code or do you do that as part of initializing the code?

I will wait for comment from NXP about execution speed from RAM - until then, my ISR stays in flash.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by curtvm on Thu May 20 10:14:40 MST 2010
It appears to me, if you want to try creating ram functions using C code for M0, you will run into a problem with the linker (I think this is the source of the problem) which appears to be using BLX incorrectly for the M0.

For example, placing a function in ram like-
void test_func(void) __attribute__((section(".data.test")));
void test_func(void){
//some code here
}

it seems will place the function correctly in ram (.data),. But it also creates 'veneer' code which loads the pc with the address of the ram function. The problem is an incorrect use of BLX is used to get to the 'veneer' code when the ram function is called, and you end up with a hard fault.

Any 'calls' to ram from flash or to flash from ram will end up using this mechanism it seems.

An alternative may be to use something like this to call a ram function (and call any flash functions from ram)-
( (void(*)(void)) &test_func + 1 )();
changing the typecast and parameters as needed (a macro would make it easier). The +1 will get the lsb of the target address set, and also prevents the use of the 'veneer' code. (I'm assuming the ram function gets aligned correctly by the linker)

I have been trying to run a function from ram (just to test), and have been stuck with the blx/linker problem, but it looks like the alternative may work (I'm away from the hardware now, so can only compile and view code).

To get the answer for the original question of this thread, it may require just trying it both ways and comparing the results.

update-
I have placed a function in ram, and 'called' it via the methods described above. It works, but the debugger does not like debugging functions in ram (get anywhere near that function, the debugger will have a problem). So in the end, until the linker(?) is corrected for the M0, and the debugger is ready to deal with ram functions, this may all be more effort than its worth (at least for speed gains).

Is there some setting I'm missing for the debugger to be able to debug ram functions?
Is the incorrect use of BLX for the M0 a linker problem?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by CodeRedSupport on Wed May 19 23:59:10 MST 2010
We have to wait for somebody from NXP to comment if running from RAM will be any faster. However, I am pretty sure that the core runs at full speed from Flash and running from RAM will be no faster (and may actually be slower).

Doing some calculations. If the core is running at 50Mhz, and you want 264ks/s, then that gives 189 cycles for your ISR. Given the interrupt response time, and function prolog/epilog, that gives you about170 cycles to actually do your work. That assumes 100% utilization. You will need to work out if that is enough.

Some things you could do:

[LIST]
[*]Make sure the code in the ISR is optimised for speed (-O3). There are also many other optimisation tweaks you can make - see the GCC docs for details of all the optimization flags available. You may want to put it into its own module so you can perform higher levels of optimization on this module. (If you are using a debug build, then this could explain why the ISR is not running as quickly as you expect).
[/LIST]
[INDENT][B]Note[/B]: You can set the build settings for an individual module by right clicking on it and selecting Properties from the context menu. If you change options for a module, its icon will be overlayed with a blue "<>" symbol. You can remove them by "Reset to Defaults" on the Settings page
[/INDENT]
[LIST]
[*]Try optimizing the C code yourself. If you have loops or branches, see if you can eliminate them (the compiler can only do so much). Any sort of branch is relatively expensive in terms of cycles, as any pre-fetch of code will have to be discarded. Make sure you give the compiler as many hints as possible, by using "volatile" or "register" as appropriate on your local variables.
[/LIST]

[LIST]
[*]If the above doesn't work, write it in assembler yourself. You can see what code the compiler generates and start from there. I'd only recommend doing this for VERY time critical modules.
[/LIST]
HTH

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Brinkand on Wed May 19 14:20:20 MST 2010
The problem is not to increase the sampling frequency, but to process the data subsequently. I can easily modify the timer to get the desired sampling frequency, but the amount of calculations per sample is constant, so I am trying to find ways to reduce the processing overhead for the ISR.

One way would be to make the slimmest possible ISR to reduce popping and pushing of registers. The other alternative is to have a slightly larger piece of the proceessing in the ISR, which requires more pushing and popping, but to speed up instruction fetching by placing this frequencly run code in RAM.

The key question is really: How much faster will code run from RAM?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by CodeRedSupport on Wed May 19 07:48:05 MST 2010
So why can't you make the timer interrupt you more frequently?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Brinkand on Wed May 19 07:30:53 MST 2010
The ADC is triggered from a timer; I need precise timing on this one.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by TheFallGuy on Wed May 19 00:49:06 MST 2010
Are you sure it will run faster from RAM? I think you'll find that the flash is optimized so that it runs at (an effective) zero wait state. Perhaps somebody from NXP could comment.

How are you triggering the sampling to run a 40k? Have you got you clock configuration setup so that it runs at max speed?