lpcware

LPC4370 FFT Performance.  I have some numbers...

Discussion created by lpcware Employee on Jun 15, 2016
Content originally posted in LPCWare by emh203 on Wed Jul 09 07:27:13 MST 2014
Just posting this as it may be helpful to others.  

See the project log:

http://hackaday.io/project/1620-The-Human-Connection-%3A-1st-Impression

---------------------------------------------------------------------------------------------------------------------------

A core part of the algorithms we will use is a complex input FFT (a+jb).    Before going to far I wanted to evaluate the FFT performance of the LPC4370 M4 core.       Now,  an FPGA would rule the roost with FFT processing horsepower  BUT I am trying to keep this as low cost as possible.   The 4370 on the LPC-Link2 is a place to start.   FPGAs are great once you have everything worked out but HDL can be unforgiving.... (and are high cost!)

So,  here are is some assumptions:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

LPC4370 -  Code running on the M4 core.  Clock rate at 204Mhz.  Exectution from RAMLoc128 (0x10000000 - 0x10020000)

ARM CMSIS DSP libraries V 4.0.1.   In particular I am looking at the function arm_cfft_radix4_q15

I am using fixed point processing.

Input data is a 4096 q15_t array in RAM.   (Note all processing is done in place... source data must be in RAM)

Optimizations are not turned on.    I used the version of GCC included with LPCXpresso 7.2

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Now,  I am targeting a 200Khz system sample rate with 4096 block size.  (This matches the max radix4 block size allowed by CMSIS DSP).   This means I have a window of 20.48mS to get all my processing done.   In the background,  new ADC data will be DMA's into a buffer and data will be DMA's from an output buffer to a DAC

So.... drum roll.   The algorithm arm_cfft_radix4_q15  takes 2.4mS.    So, I have roughly a fact of 10 margin.   Now, this will quickly get eaten up.  I have to do a minimum 2 FFTs (forward and reverse transform),  the magically scaling algorithms.   Either way, this gives me a good amount of overhead.    I always have 2 other cores ready to go :-)

I also profiled arm_cfft_radix2_q15.   It is a bit slower at 2.9mSec.

Code is in the hc-1 Github repository.

Last notes:

The board support library sometimes crashes in Board_SystemInit() at bootup when running from RAM.  I think a delay is need when setting up clock dividers or the crystal.  If I single step through the code,  it works...   Also,   using the internal osc and PLLing up to 204MHz is fine.

These numbers would certainly get awful if running from SPIFI Flash.  (LPC-4370 is ROM-Less.   You have to bootload from SPIFI flash into RAM or execute from SPIFI...)   Maybe I can do that some other day

Outcomes