How to split M0 code in multicore project between external SDRAM and internal SRAM using linker scripts

manliozanotti · ‎07-27-2016

Hi All,

I'm working on device based on LPC4350 and use multicore M4F and M0 settings.

I have external SPIFI where all firmware is stored (M0+M4) and external 16MB SDRAM.

I was able to use linker scripts to place some M4 data in external SDRAM and some (most used) in internal 128KB SDRAM.Some code is left on SPIFI.

Actually M0 code/data is completely placed in 72Kbytes internal SRAM butI have few bytes left and I need to free up some bytes moving outside from internal SRAM function/data less used (like initialization functions).

In MCU settings for M0 I have this memory mapping:

So as default the whole code is placed in RamLoc72 but I want to move some functions/data to SDRAM.

Here I encountered big troubles trying to do the above since the M0 axf.o file is then included in M4 axf and thus there's not direct control of where the final code if M0 is stored.

I tryed hard to find a good and clean way to do that using scripts (main_data.ld,text.ldt,etc..) since it doesn't seem to be doable solution.

I tried the below:

1) the first thing I tried was to use __RAMFUNC(RAM2) in front of functions and looking at generated map file I correctly see this function placed in SDRAM. However this doesn't solve the problem since the code is placed in two places: in SDRAM (as wanted) and also in RAM (default places all code there) and startup code of M0 just copy the function data from RAM to SDRAM

2) I flipped SDRAM and RamLoc72 in memory configuration so the default place where LPCxPresso places the M0 code/data is in SDRAM. Then I moved almost all the code to SRAM (using scipts attached) apart startup and main code.

It worked! At least it seemed so...but after some changes to fw I had strange malfunction in the code and I concluded that some memory initialization fails using this method.

So my question is: how to spli code/data of M0 firmware either in internal SRAM and in SDRAM in a clean way?

Thank you for any help!!

Original Attachment has been moved to: main_text_section.ldt.zip

Original Attachment has been moved to: main_text.ldt.zip

Original Attachment has been moved to: main_rodata.ldt.zip

Original Attachment has been moved to: main_data.ldt.zip

Original Attachment has been moved to: bss.ldt.zip

Original Attachment has been moved to: global_section_table.ldt.zip

Original Attachment has been moved to: data_section.ldt.zip

Original Attachment has been moved to: bss_section.ldt.zip

Original Attachment has been moved to: main_bss_section.ldt.zip

lpcxpresso_supp · ‎07-29-2016

To be honest, this is the first time I have heard of someone running out of RAM to put their slave MCU application in. This sort of memory layout was never envisaged in the design of the multicore project mechanism within LPCXpresso IDE.

The application running on the slave MCU (i.e.. the Cortex-M0) is basically assumed to occupy a single RAM bank, for code, data, heap and stack. The simplest way to use more than one RAM bank would be to move the heap and stack into another bank...

Heap allocation/checking in Redlib

You could probably move specific variables into another RAM bank using the macros we provide. But avoid initialised data - as this will still consume memory in the original RAM bank for its initial value before that is copied by the M0's init code in the second bank. So double check everything is BSS (look in your map file if unsure).

Placing data into different RAM blocks

Theoretically you could use Freemarker linker script templates to place all of your data into a second RAM bank, and just have your code in the first. However this would require quite a lot of changes to the default templates to achieve. But it is something we could consider looking into providing in a more out of the box manner for a future tools release (though not short term).

But as far as M0 code goes, splitting this into multiple RAM banks just isn't going to work unless you bypass the whole of the multicore project structure and do everything yourself manually - which will definitely not be straightforward!!

Maybe you could do something to reduce your code size. For instance if building for Debug, consider turning the level of optimisation up for certain portions of your code:

Compiler Optimization

Application Flash / RAM size

and make sure that you are building your M0 application with the "None" variant of the C library if possible:

What are none, nohost and semihost libraries?

Switching the selected C library

Regards,

LPCXpresso Support

View solution in original post

massimomanca · ‎08-04-2016

Hi Manlio,

to be more clear I will copy part of your last post and my suggestion after that block by block.

"I'll try now to rearrange the code to fit as you explained but I have a doubt. Some 3rd party code (FreeRTOS) uses global variables and I don't want to edit that code. I have to modify linkscripts to force all data to be placed in another internal SRAM bank but for what I understood while working with linkscripts, if data is not marked as UNINITIALIZED, that will be also stored in main memory and then copied by startup code to 2nd bank."

By default the compiler puts uninitialized variables in bss segment clearing them at reset, instead it puts initialized variables in the data segment and of course to write them to the initialized value it stores all initialization values in a code segment. So, if you don't initialize variables they will be allocated in bss. Remember that also explicitely initializing a variable at 0 means it will be put in data segment. About FreeRTOS I would try to compile it and study a little bit the map file and the result of its compilation adding a minimalized main. If you will see data segment populated you have to find the initialized variables and put them in a specific segment you will define inside the linker script. You may design your memory map so that non contiguous RAM area may be treated as a single bank from the point of view of the GCC linker.

Example:

....

	__bss_section_table = .;
	LONG(	ADDR(.bss));
	LONG( SIZEOF(.bss));
	LONG(	ADDR(.bss_RAM2));
	LONG( SIZEOF(.bss_RAM2));
	__bss_section_table_end = .;
	__section_table_end = . ;

	.bss_RAM2(NOLOAD) : ALIGN(4)
	{
	(.bss.$RAM2)
	(.bss.$RamAHB32)
	*(AHBSRAM0)
	. = ALIGN(4) ;
	} > RamAHB_USB

	.bss_RAM3(NOLOAD) : ALIGN(4)
	{
	*(AHBSRAM1)
	. = ALIGN(4) ;
	} > RamAHB_Eth
	/* MAIN BSS SECTION */
	.bss(NOLOAD) : ALIGN(4)
	{
	_bss = .;
	(.bss)
	*(COMMON)
	. = ALIGN(4) ;
	_ebss = .;
	PROVIDE(end = .);
	__end__ = .;
	} > RamLoc32

"Another question that wasn't clear to me in documentation (or I didn't find it): how to move heap and stack to some memory in a clean way without using symbols in Tool Settings->Symbols?

Actually I placed there definitions of STACK_SIZE=768 and HEAP_SIZE=0x4000-STACK_SIZE but if in the same bank I need to allocate other data, I would easly made some mistakes with data overlapping.

How to leave all such task to compiler?"

To have maximum freedom, that is what you need in this situation, the best thing to do is edit a linker file (.ld) and customize it for your needs. Personally I prefer to edit the ld file by hands because it always works also after years and IDEs updates. A good starting point for you should be to look at startup.cpp and .ld file of the mbed project, there is a port for LPC4088 and LPC4330 that are not so different to LPC4350. If you have problems to find it I will send you a copy of it (my email: massimo.manca@micronengineering.it). Anyway new versions of LpcXpresso have an user friendly interface to add and split memory spaces, may be a little bit more comfortable for you, just make a try using it, it should work.

"I looked at you code but you didn't use relocation of data/code.

Yes, I used hearbeat from M0 to M4 (for now simple shared global memory location) where M0 writes a 32bit predefined PATTERN every seconds, and M4 reads it and clears if content matches the PATTERN. If so M4 also resets watchdog. In this way if one of two cores stop working as expected, watchdog will reset."

You could compile the code in several ways, NXP gave me the first Hitex development board and some complications are managed inside Keil project and sepcific files. You can also look at startup and linker file of that project but take care the syntax is different by GNU/GCC.

About watchdog: in my opinion is difficult to design a properly protected system with 2 or more cores witho only one watchdog. Anyway that is the situation of LPC4350 if you don't add an external watchdog.

I mean that a LPC4350 shouldn't reset only because the M0 coprocessor is not working as expected, also in your application you should reset only the M0, there is no reason to reset also M4 or at least you wouldn't because you should write your application in a way to guarantee you don't require to reset M4 for a problem on M0 side, this is another reason to separate their memory spaces.

Same thing for the RTOS, you should design your system to don't reset the microcontroller if only a task hangs up and to prevent interlocks between tasks or taks always not executed.

One of the big faults I found in applications where I work as a consultant to recover a project in trouble is watchdog management when a RTOS is used or a LPC43/LPC54 multi core mcu is used.

Because you can't manage 2 cores with one watchdog you should design the simpler mechanism that works in your application so that the M4 could decide to reset almost safely the M0 and design the watchdog system to work only in M4 side.

Considering M4 should implement the main logic of the application, when you are not able to restart just the offending task you should be sure the watchdog will fire but that is too much if only the M0 is not working properly. To reset the M0 search for M0APP_RST it should be the #55 or #56 reset of LPC4350.

So, as I said at the beginning of my post there are some application points to design from beginning taking care of the LPC43 mcu microcontrollers architecture, that is the most different and sometime difficult point to understand and to implement. But if you will do correctly you will manage the most powerful M4 mcu you can find actually and you will be able to put its performance to a so high limit you will go over many or all M7 single mcu you may find now.

manliozanotti · ‎08-04-2016

Hi Massimo,

thank you again for you great support and clear explanation!

I agree with you that program should be separated from data in memory since as you said M0 has single bank access so everything would be delayed. SPIFI is not used anymore for code/data after M4 completed initialization. I JUST store there non volatile settings(in a different SPIFI bank to avoid unwanted stored firmware corruption). No other data is read/written or code executed from there during normale code execution.

I'll try now to rearrange the code to fit as you explained but I have a doubt. Some 3rd party code (FreeRTOS) uses global variables and I don't want to edit that code. I have to modify linkscripts to force all data to be placed in another internal SRAM bank but for what I understood while working with linkscripts, if data is not marked as UNINITIALIZED, that will be also stored in main memory and then copied by startup code to 2nd bank. This is what I wasn't able to do, to avoid usage of main internal memory just as a temporary storage for data/bss. I would like M4/M0 startup code will copy data/bss to the target memory bank WITHOUT store a duplicated also in default memory bank.

Another question that wasn't clear to me in documentation (or I didn't find it): how to move heap and stack to some memory in a clean way without using symbols in Tool Settings->Symbols?

Actually I placed there definitions of STACK_SIZE=768 and HEAP_SIZE=0x4000-STACK_SIZE but if in the same bank I need to allocate other data, I would easly made some mistakes with data overlapping.

How to leave all such task to compiler?

I looked at you code but you didn't use relocation of data/code.

Yes, I used hearbeat from M0 to M4 (for now simple shared global memory location) where M0 writes a 32bit predefined PATTERN every seconds, and M4 reads it and clears if content matches the PATTERN. If so M4 also resets watchdog. In this way if one of two cores stop working as expected, watchdog will reset.

massimomanca · ‎08-03-2016

Hello Manlio,

I was involved in the LPC4350 Expert program and I was in charge to test how divide applications between the 2 cores, I received a report about your problem and to give you some help/suggestion. I am Italian so we could write in Italian but I suggest to continue to use English so that everyone may benefit by this thread.

What I can say is that it is right to have the cores accessing different memory spaces both for program and data.
Take care than the M4 core starts at reset so its program must be present in one of the memory spaces managed by the internal bootloader. If you have 2 flash banks or one internal flash bank and an external flash you may decide to have on one the M4 application and on the other the M0 application. In 4350 there is no flash inside but 264KB of RAM. If your M0 application has to be loaded on RAM the M4 have to put it on RAM for you, for this reason in LpcXpresso project you will find the axf file of the M0 also in the M4 project because it has to write it in RAM before start the M0 core.

If your application is not so big normally it is possible to put everything inside RAM but in your case seems not possible.
From the point of view of the speed I agree that M0 application should stay on a single RAM bank both for program and data but I don't agree it should be the preferred way because security problems and not predictable behavior in case of stack overflow or stack corruption, if you use the heap the situation is worst.

I don't know your application but designing the 1st application on a 4350 to be effective needs more architecture design compared to single core mcus. The most important thing to do is to divide as much as possible the 2 mcus data and program spaces to don't have any bus conflict and use a message passing protocol between the 2 (or 3 cores in 4370) to exchange data. I both tried using a file register approach in both sides and using a message passing mechanism. They both work quite well, the register approach is a little bit more user friendly and more complex because anyway it needs 4 queues (2 on both sides).

So my simpler suggestion is to move all the M0 data on a different internal SRAM bank and reserve the 72KB bank to the code.
But I strongly suggest you, if you want, to describe better how is done your application, what is your mechanism to share data between the cores and how you divided the applications spaces and between the 2 cores so that I may help you better.

What I need more in detail is memory partitioning and actual memory used verso planned for each mcu and bank.

Kind regards,

Massimo

manliozanotti · ‎08-03-2016

Thank you all for you answers.

Hello Massimo, nice to meet you, thank you very much for your support. Of course let continue in english if someone else would be interested to use LPC43xx in such "strange" configuration.

I try to explain better how I placed the whole code and data for both CPUs.

On both CPUs is running FreeRTOS and I have 3 different memory spaces: 1) very fast internal 264KB SRAM 2) fast external 32MB SDRAM 3) slow external 4 MB SPIFI.

I tried to map code and data so that the most used functions and used data are inside SRAM, not intensive used code or big buffers in external SDRAM and constant data or initialization data/code in external SPIFI. Of course at the reset the whole M4+M0 code is stored in external SPIFI and then copied by startup code to other memory spaces.

The device acts as the head of an antintrusion alarm system and has 2 868Mhz RTX , 2 RS485 buses, analog I/O, actuators, USB,UMTS module and ethernet port.

I decided to subdivide the whole work into 2 cores based on functions.

M0 core:

I put here all the code required to implement the low level drivers (radio communication, RS485 bus communication, hardware I/O and analog readings) plus the main device logic. It acts as a complete, standalone central alarm system main unit, where it checks for connected sensor's alarms and generates notifications to M4 and to other output devices (i.e. siren).

The bahavior of M0 logic is based on a memory buffer shared between M0 and M4 cores: here is where all system configuration is stored.

I decided for M0 this memory map:

Internal 72Kb SRAM: code and data (Except big buffers)

External SDRAM: big buffers

Internal AHB16 16KB: for stack and HEAP

M4 core:

I put here all the code required for ethernet/internet communication (with apps or HTML pages) where the settings of M0 logic behavior is created and modified. Here is placed also all the code to send emails, SMS, make voice calls,text 2 speech synthesys and other higher level tasks.

I decided for M4 this memory map:

Internal 128KB SRAM: most used code and data

External SDRAM: less used and big code/data, slow heap

External SPIFI for intitialization code and to store/read SETTINGS

Internal AHB32 32KB for stack and fast heap

SHARED

Communication between two cores are based on MESSAGE QUEUES (to send notification from M0 to M4 or command from M4 to M0) and shared memory buffers.

Shared buffers are:

SDRAM: SETTINGS and STATUS structures

AHB_IPC to store IPC QUEUE structures and other hardwired absolute memory locations used by both cores to take control of memory space.

I implemented a semaphore method to gain access to this shared memory area using a simple code:

#define SHARED_SETTINGS_ACC_FLG0 0x2000C8f0		//
#define SHARED_SETTINGS_ACC_FLG1 0x2000C8f1		//
#define SHARED_SETTINGS_ACC_TURN 0x2000C8f2	//

#define SETTINGS_LOCK_FLAG0 (*((volatile uint8_t*)SHARED_SETTINGS_ACC_FLG0))

#define SETTINGS_LOCK_FLAG1 (*((volatile uint8_t*)SHARED_SETTINGS_ACC_FLG1))

#define SETTINGS_LOCK_TURN (*((volatile uint8_t*)SHARED_SETTINGS_ACC_TURN))

static inline void SETTINGS_LOCK_ENTER()

{

#ifdef CORE_M4

ASSERT(SETTINGS_LOCK_FLAG0==false);

SETTINGS_LOCK_FLAG0 = true;

SETTINGS_LOCK_TURN = 1;

while (SETTINGS_LOCK_FLAG1 && SETTINGS_LOCK_TURN == 1)

vTaskDelay(1);

#else

ASSERT(SETTINGS_LOCK_FLAG1==false);

SETTINGS_LOCK_FLAG1 = true;

SETTINGS_LOCK_TURN = 0;

while (SETTINGS_LOCK_FLAG0 && SETTINGS_LOCK_TURN == 0)

vTaskDelay(1);

#endif

}

static inline void SETTINGS_LOCK_EXIT()

{

#ifdef CORE_M4

SETTINGS_LOCK_FLAG0 = false;

#else

SETTINGS_LOCK_FLAG1 = false;

#endif

}

Also, inside each core there's a FreeRTOS semaphore implementation to serialize tasks accesses.

I hope I have made more clear how I split the job on LPC4350 and why I reached SRAM out of memory on M0 code.

I could exchange 128KB M4 memory with 72KB M0 memory so I can have some more space on M0, since under M4 I didn't have any single problem to move data/code to any memory place.

Has I told before, I was surprised that this didn't work the same on M0: I was able to split functions/data as I wanted but there' s some bug there and wrong initialization of DATA happens doing so.

Massimo, if you need other detailed descriptions of code or memory configuration please tell me.

Thank you in advance

massimomanca · ‎08-03-2016

Manlio,

I understand your point of view but I suggest you to change how you partitioned the application.

First of all the application should be able to start without any external memory (apart SPIFI flash that is needed with the LPC4350). So at least the M4 should have program and stack area in any internal ram also if it should be a lot better have also heap, data andd bss ram segments inside. The advantage is that in this way M4 will always reset and start up.

What I really don't like in a real application that should always work is to use the same memory bank or the same memory chip, if it has a linear addressing scheme, for different uses.

So I wouldn't like to use the external SPIFI to store both program and read/write setup data. I saw the worst things happening... think that you could loss your application code for any problem while writing setup data, same thing for RAM, when we have more RAM banks it is better to reserve them to specific uses. Remember also that when writing to single bank flash chips you can't do almost anything until data write completes and it takes a quite long time.

Also you should use the MPU to protect memory segments I think that having 2 RTOS instances running you should need more than 8 MPU channels.

In your situation, because you are using also ethernet networking, I should suggest to reserve at least a 16KB bank only for it. In this way you can also use DMA access without bus contention problems.

Then you should reserve the 72KB and 128KB for M4 and M0 program space (just use them as they may fit better) and then the other banks for data/stack/heap. Use the bank that fits better. If your data can't fit together in internal ram I should reserve external ram for bss and data and eventually the heap but at least stack and heap should be in internal ram for both cores.

I wouldn't like to split program space between external and internal ram because you would assume the external part is not working for any "impossible" reason. Working with microcontrollers has the advantage we can start our applications if the mcu is working without so much help by external peripherals.

In practice, working in C/C++, you should avoid to use global variables so variables would live in the stack or in the heap, for this reason allocating the other data segment on external RAM doesn't penalize so much your application.

Last but not least I would move the main application logic on the M4 because it is the mcu that starts first and it can stop and put in wait the M0, instead the M0 can't.

You should think to M0 as a sort of coprocessor, M0 shouldn't start for a lot of reasons (mainly bugs, corrupted content on the SPIFI and so on) and could be difficult to understand the right reason.

The last suggestion is to use some heartbeat message from M0 to M4 to be sure it is running.

My original example of applications was the base for the IPC Application Note, it was designed for Keil uVision but there are no many problems to port to LpcXpresso. It was keeped simple, without using a RTOS intentionally and quite well documented considering was completed in a couple of months during spare time and debugging a new microcontroller with new board, new compiler and none library. You may find both the code and a pdf here:

Blue River Parking System | www.LPCware.com

I think may be useful.

lpcxpresso_supp · ‎07-29-2016

To be honest, this is the first time I have heard of someone running out of RAM to put their slave MCU application in. This sort of memory layout was never envisaged in the design of the multicore project mechanism within LPCXpresso IDE.

The application running on the slave MCU (i.e.. the Cortex-M0) is basically assumed to occupy a single RAM bank, for code, data, heap and stack. The simplest way to use more than one RAM bank would be to move the heap and stack into another bank...

Heap allocation/checking in Redlib

You could probably move specific variables into another RAM bank using the macros we provide. But avoid initialised data - as this will still consume memory in the original RAM bank for its initial value before that is copied by the M0's init code in the second bank. So double check everything is BSS (look in your map file if unsure).

Placing data into different RAM blocks

Theoretically you could use Freemarker linker script templates to place all of your data into a second RAM bank, and just have your code in the first. However this would require quite a lot of changes to the default templates to achieve. But it is something we could consider looking into providing in a more out of the box manner for a future tools release (though not short term).

But as far as M0 code goes, splitting this into multiple RAM banks just isn't going to work unless you bypass the whole of the multicore project structure and do everything yourself manually - which will definitely not be straightforward!!

Maybe you could do something to reduce your code size. For instance if building for Debug, consider turning the level of optimisation up for certain portions of your code:

Compiler Optimization

Application Flash / RAM size

and make sure that you are building your M0 application with the "None" variant of the C library if possible:

What are none, nohost and semihost libraries?

Switching the selected C library

Regards,

LPCXpresso Support

How to split M0 code in multicore project between external SDRAM and internal SRAM using linker scripts

How to split M0 code in multicore project between external SDRAM and internal SRAM using linker scripts

Compiler Assembler Linker

LPCXpresso Forum