Slow execution speed in Parallel EMC Nor flash

priyankb · ‎10-03-2018

Hi,
I am using LPC4088 based custom board with Nor Flash(SST39VF3202C), SRAM(CY62157EV30LL) & FGPA all working on the EMC. The CPU, Peripheral & EMC all are working on 72 MHz.
A secondary USB custom bootloader is in internal Flash. The user code is stored in external Nor flash. Wait states are same as in "norflash_prog" in LPCOpen example code & also I enable buffer mode just before jumping to user application(which helped to improve execution speed).

I am able to store generated S19 file into external flash using USB bootloader & it runs as well but it executes very slowly. The whole program seems to be lagging even without delay.
Due to this I cannot generate timer interrupt of <(100 us)(which is near about 2.5 us in internal flash) & that too not precise.
The nor flash does not support page read mode so I will try S29GL064S which has page read mode & is also pin compatible.

In many discussions I read that SPIFI based flash is faster than parallel nor flash. how?

What else can I do? Any suggestions would be helpful.

Thank you
Priyank.

priyankb · ‎10-21-2018

Hi,

I recently tried to find out the number of instruction cycles used in boot loader & same in the user application using DWT registers provided in Cortex M4 and found out that "nop" instruction uses 42 instruction cycles in boot loader code(internal flash) whereas it takes around 3500 instruction cycles in user application.

Also learned that if I decrease clock frequency the instruction cycles actually decreases. I calculated that 1 instruction cycle = 1 machine cycle = 1 clock cycle & the minimum time I get is same as instruction cycles/CPU clock frequency.

CPU clock	EMC clock	Pripheral clock	bootloader Ins cycles	user app ins cycles
24	24	24	42	1400-1500
48	48	48	42	3200
72	72	72	42	10000
96	48	48	42	37259
120	60	60	42	44000

Anyone having any idea as to what I should do to improve my execution speed?

Thank you.

Priyank.

jeremyzhou · ‎10-22-2018

Hi Priyank Bhatt，

Thanks for your reply.
I'm curious the testing flow you did, as according to the Cortex™-M4 Technical Reference Manual, nop instruction only cost 1 cycle. So I was wondering if you can share the more details about the testing procedure.

1) Our application is very big (~3-4MB), would it be feasible to copy it to SRAM & run from there?
-- I'm afraid not.
2) When I jump from bootloader to user app into Nor flash does the code executes from Flash as Execute-in-Place or copy to internal SRAM and then execute?
-- It would execute in the flash.
3) I have ordered a new flash from cypress with page read mode, since the current flash does not allow page read mode. Would that make a difference in the execution of code?
-- I think it can make a difference.

In further, you can consider using an SDRAM instead of Nor-flash, its speed definitely exceed the Nor-flash, meanwhile, you may need an additional SPIFI flash to store the application, as its size is a bit big (> 3 MB).
Have a great day,
TIC

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

priyankb · ‎10-22-2018

Hi jeremyzhou‌,

I have found a code for Cortex M3/M4 to get the instruction cycle count. The code I used is:

volatile unsigned int *DWT_CYCCNT  ;
volatile unsigned int *DWT_CONTROL ;
volatile unsigned int *SCB_DEMCR   ;

void reset_timer(void)
{
    DWT_CYCCNT   = (unsigned int *)0xE0001004; //address of the register
    DWT_CONTROL  = (unsigned int *)0xE0001000; //address of the register
    SCB_DEMCR    = (unsigned int *)0xE000EDFC; //address of the register
    *SCB_DEMCR   = *SCB_DEMCR | 0x01000000;
    *DWT_CYCCNT  = 0; // reset the counter
    *DWT_CONTROL = 0;
}

void start_timer(){
    *DWT_CONTROL = *DWT_CONTROL | 1 ; // enable the counter
}

void stop_timer(){
    *DWT_CONTROL = *DWT_CONTROL | 0 ; // disable the counter
}

unsigned int getCycles(){
    return *DWT_CYCCNT;
}


Inside main():-

reset_timer();
start_timer();
asm("nop");
stop_timer();
DEBUGOUT("Cycles: %d\r\n", getCycles());‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

I am using 24 MHz oscillator configured to 72 MHz CPU clock, EMC clock, & Peripheral clock. Please try this and acknowledge.

I also came across a strange thing, when I have used 12 MHz crystal(not oscillator) instead of 24 MHz crystal that I was using, the instruction cycles in user app came down to almost 1/3rd of the previous with same frequency configuration. I don't understand this, though LPC4088 UM does recommend 12 MHz crystal.

A few more doubts which I read in ARM PrimeCell MultiPort Memory Controller documents:

If the memory bus is multiplexed externally, for example by using an EBI, the worst-case transfer latency is affected because the external bus is shared by multiple devices.
AHB port 0 is the highest priority port, AHB port 3 is the lowest priority port.
Lower priority AHB memory ports can be locked out indefinitely if a higher priority AHB memory port continually performs memory requests.
The memory controller AHB memory ports are prioritized. If a master connected to a HIGH priority port performs continuous transactions, lower priority ports are not able to access the bus until the higher priority port has completed its transactions.

Now I have connected FPGA to CS0, SRAM to CS1, & Nor Flash to CS3. Can you help me explain above statements in reference to my hardware connections. After reading this I was planning to swap FPGA & Nor-Flash & see what happens.

Regards

Priyank.

jeremyzhou · ‎10-22-2018

Hi Priyank Bhatt，

Thanks for your reply.
1） I also came across a strange thing, when I have used 12 MHz crystal(not oscillator) instead of 24 MHz crystal that I was using, the instruction cycles in user app came down to almost 1/3rd of the previous with same frequency configuration. I don't understand this, though LPC4088 UM does recommend 12 MHz crystal.
-- Yes, it seems a bit weird.
2) Now I have connected FPGA to CS0, SRAM to CS1, & Nor Flash to CS3. Can you help me explain the above statements in reference to my hardware connections? After reading this I was planning to swap FPGA & Nor-Flash & see what happens.
-- For EMC module, I don't think the CSx are prioritized, however, you can give a try.
Have a great day,
TIC

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

jeremyzhou · ‎10-07-2018

Hi Priyank Bhatt ,

Thank you for your interest in NXP Semiconductor products and for the opportunity to serve you.
When working with the same clock, in general, the parallel Nor flash has a greater execution speed than the SPIFI Nor flash.
To improve the performance, I'd like to suggest you increase the speed of the CPU and peripheral clock, or copy the application code to SRAM to run, etc.
Have a great day,
TIC

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

priyankb · ‎10-07-2018

Hi jeremyzhou‌,

Thank you for replying.

When working with the same clock, in general, the parallel Nor flash has a greater execution speed than the SPIFI Nor flash.

I also thought parallel memory must be faster than serial even if the fact that SPIFI has a dedicated cache to improve speed. That is why we chose parallel flash over serial. I have kept clock at 72 MHz because that is closest I can go to EMC's limit of 80 MHz.

Now I have tested the code with every combination of clock configurations & a weird result came. When I increase the clock the execution actually becomes slower which logically should not happen. That is why I have a serious doubt on initialization of either EMC flash or clock. These are the observations:

CCLK	PCLK	EMC_CLK	Timer_Int(us)	delay(loop)	nop

24	24	24		35 - 45
48	48	48	60 - 90	50 - 60	30 - 55
72	72	72	120 - 140	120 - 140	80 - 90
96	48	96	330 - 350	260 - 280	220 - 230
96	96	96	280 - 350	250 - 280	240 - 260
120	60	120	330 - 350	250 - 270	200 - 220
120	60	60	330 - 350	280 - 320	230 - 250

To improve the performance, I'd like to suggest you increase the speed of the CPU and peripheral clock, or copy the application code to SRAM to run, etc.

Our application is very big (~3-4MB), would it be feasible to copy it to SRAM & run from there?

When I jump from bootloader to user app into Nor flash does the code executes from Flash as Execute-in-Place or copy to internal SRAM and then execute?

I also tried to use compiler optimization options which has increased execution speed but then our user app doesn't go past splash screen(we are using emWin for LCD display).

I have ordered a new flash from cypress with page read mode, since the current flash does not allow page read mode. Would that make a difference in the execution of code?

Regards

Priyank.

Slow execution speed in Parallel EMC Nor flash

Slow execution speed in Parallel EMC Nor flash

LPC40xx

LPCOpen

Peripherals

USB