Slow execution speed in Parallel EMC Nor flash

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Slow execution speed in Parallel EMC Nor flash

1,457 Views
priyankb
Contributor III

Hi,
I am using LPC4088 based custom board with Nor Flash(SST39VF3202C), SRAM(CY62157EV30LL) & FGPA all working on the EMC. The CPU, Peripheral & EMC all are working on 72 MHz.
A secondary USB custom bootloader is in internal Flash. The user code is stored in external Nor flash. Wait states are same as in "norflash_prog" in LPCOpen example code & also I enable buffer mode just before jumping to user application(which helped to improve execution speed).

I am able to store generated S19 file into external flash using USB bootloader & it runs as well but it executes very slowly. The whole program seems to be lagging even without delay.
Due to this I cannot generate timer interrupt of <(100 us)(which is near about 2.5 us in internal flash) & that too not precise.
The nor flash does not support page read mode so I will try S29GL064S which has page read mode & is also pin compatible.

In many discussions I read that SPIFI based flash is faster than parallel nor flash. how?

What else can I do? Any suggestions would be helpful.

Thank you
Priyank.

Labels (4)
0 Kudos
6 Replies

1,092 Views
priyankb
Contributor III

Hi,

I recently tried to find out the number of instruction cycles used in boot loader & same in the user application using DWT registers provided in Cortex M4 and found out that "nop" instruction uses 42 instruction cycles in boot loader code(internal flash) whereas it takes around 3500 instruction cycles in user application.

Also learned that if I decrease clock frequency the instruction cycles actually decreases. I calculated that 1 instruction cycle = 1 machine cycle = 1 clock cycle & the minimum time I get is same as instruction cycles/CPU clock frequency.

CPU clockEMC clockPripheral clockbootloader Ins cyclesuser app ins cycles
242424421400-1500
484848423200
7272724210000
9648484237259
12060604244000

Anyone having any idea as to what I should do to improve my execution speed?

Thank you.

Priyank.

0 Kudos

1,092 Views
jeremyzhou
NXP Employee
NXP Employee

Hi Priyank Bhatt

Thanks for your reply.
I'm curious the testing flow you did, as according to the Cortex™-M4 Technical Reference Manual, nop instruction only cost 1 cycle. So I was wondering if you can share the more details about the testing procedure.

pastedImage_1.png
1) Our application is very big (~3-4MB), would it be feasible to copy it to SRAM & run from there?
-- I'm afraid not.
2) When I jump from bootloader to user app into Nor flash does the code executes from Flash as Execute-in-Place or copy to internal SRAM and then execute?
-- It would execute in the flash.
3) I have ordered a new flash from cypress with page read mode, since the current flash does not allow page read mode. Would that make a difference in the execution of code?
-- I think it can make a difference.

In further, you can consider using an SDRAM instead of Nor-flash, its speed definitely exceed the Nor-flash, meanwhile, you may need an additional SPIFI flash to store the application, as its size is a bit big (> 3 MB).
Have a great day,
TIC

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

0 Kudos

1,092 Views
priyankb
Contributor III

Hi jeremyzhou‌,

I have found a code for Cortex M3/M4 to get the instruction cycle count. The code I used is:

volatile unsigned int *DWT_CYCCNT  ;
volatile unsigned int *DWT_CONTROL ;
volatile unsigned int *SCB_DEMCR   ;

void reset_timer(void)
{
    DWT_CYCCNT   = (unsigned int *)0xE0001004; //address of the register
    DWT_CONTROL  = (unsigned int *)0xE0001000; //address of the register
    SCB_DEMCR    = (unsigned int *)0xE000EDFC; //address of the register
    *SCB_DEMCR   = *SCB_DEMCR | 0x01000000;
    *DWT_CYCCNT  = 0; // reset the counter
    *DWT_CONTROL = 0;
}

void start_timer(){
    *DWT_CONTROL = *DWT_CONTROL | 1 ; // enable the counter
}

void stop_timer(){
    *DWT_CONTROL = *DWT_CONTROL | 0 ; // disable the counter
}

unsigned int getCycles(){
    return *DWT_CYCCNT;
}


Inside main():-

reset_timer();
start_timer();
asm("nop");
stop_timer();
DEBUGOUT("Cycles: %d\r\n", getCycles());‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

I am using 24 MHz oscillator configured to 72 MHz CPU clock, EMC clock, & Peripheral clock. Please try this and acknowledge.

I also came across a strange thing, when I have used 12 MHz crystal(not oscillator) instead of 24 MHz crystal that I was using, the instruction cycles in user app came down to almost 1/3rd of the previous with same frequency configuration. I don't understand this, though LPC4088 UM does recommend 12 MHz crystal.

A few more doubts which I read in ARM PrimeCell MultiPort Memory Controller documents:

  • If the memory bus is multiplexed externally, for example by using an EBI, the worst-case transfer latency is affected because the external bus is shared by multiple devices.
  • AHB port 0 is the highest priority port, AHB port 3 is the lowest priority port.
  •  Lower priority AHB memory ports can be locked out indefinitely if a higher priority AHB memory port continually performs memory requests.
  • The memory controller AHB memory ports are prioritized. If a master connected to a HIGH priority port performs continuous transactions, lower priority ports are not able to access the bus until the higher priority port has completed its transactions.

Now I have connected FPGA to CS0, SRAM to CS1, & Nor Flash to CS3. Can you help me explain above statements in reference to my hardware connections. After reading this I was planning to swap FPGA & Nor-Flash & see what happens.

Regards

Priyank.

0 Kudos

1,092 Views
jeremyzhou
NXP Employee
NXP Employee

Hi Priyank Bhatt

Thanks for your reply.
1) I also came across a strange thing, when I have used 12 MHz crystal(not oscillator) instead of 24 MHz crystal that I was using, the instruction cycles in user app came down to almost 1/3rd of the previous with same frequency configuration. I don't understand this, though LPC4088 UM does recommend 12 MHz crystal.
-- Yes, it seems a bit weird.
2) Now I have connected FPGA to CS0, SRAM to CS1, & Nor Flash to CS3. Can you help me explain the above statements in reference to my hardware connections? After reading this I was planning to swap FPGA & Nor-Flash & see what happens.
-- For EMC module, I don't think the CSx are prioritized, however, you can give a try.
Have a great day,
TIC

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

0 Kudos

1,092 Views
jeremyzhou
NXP Employee
NXP Employee

Hi Priyank Bhatt ,

Thank you for your interest in NXP Semiconductor products and for the opportunity to serve you.
When working with the same clock, in general, the parallel Nor flash has a greater execution speed than the SPIFI Nor flash.
To improve the performance, I'd like to suggest you increase the speed of the CPU and peripheral clock, or copy the application code to SRAM to run, etc.
Have a great day,
TIC

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

0 Kudos

1,092 Views
priyankb
Contributor III

Hi jeremyzhou‌,

Thank you for replying.

When working with the same clock, in general, the parallel Nor flash has a greater execution speed than the SPIFI Nor flash.

I also thought parallel memory must be faster than serial even if the fact that SPIFI has a dedicated cache to improve speed. That is why we chose parallel flash over serial. I have kept clock at 72 MHz because that is closest I can go to EMC's limit of 80 MHz.

Now I have tested the code with every combination of clock configurations & a weird result came. When I increase the clock the execution actually becomes slower which logically should not happen. That is why I have a serious doubt on initialization of either EMC flash or clock. These are the observations:

CCLKPCLKEMC_CLKTimer_Int(us)delay(loop)nop
24242435 - 45
48484860 - 9050 - 6030 - 55
727272120 - 140120 - 14080 - 90
964896330 - 350260 - 280220 - 230
969696280 - 350250 - 280240 - 260
12060120330 - 350250 - 270200 - 220
1206060330 - 350280 - 320230 - 250

To improve the performance, I'd like to suggest you increase the speed of the CPU and peripheral clock, or copy the application code to SRAM to run, etc.

Our application is very big (~3-4MB), would it be feasible to copy it to SRAM & run from there?

When I jump from bootloader to user app into Nor flash does the code executes from Flash as Execute-in-Place or copy to internal SRAM and then execute?

I also tried to use compiler optimization options which has increased execution speed but then our user app doesn't go past splash screen(we are using emWin for LCD display).

I have ordered a new flash from cypress with page read mode, since the current flash does not allow page read mode. Would that make a difference in the execution of code?

Regards

Priyank.

0 Kudos