KISS FFT Performance on MSC8144

zpyatt · ‎02-11-2012

Hi,

Does 2.45 milliseconds for a single 512 point FFT (32-bit fixed-point, executing on a single core from M3 memory) seem outrageous for the MSC8144ADS ? I've seen some benchmarks for TI processors indicating ~150 usec.

Thanks,

zpyatt

edit: I forgot to mention this is with optimization level 4 for execution speed, and seems to be independent of SmartDSP or C vs. C+++.

zpyatt · ‎02-22-2012

Tom, thanks for the input I really appreciate it.

So I made to massive discoveries.

1.) I was setting the compiler directive FIXED_POINT for KISS FFT to 32, I thought this was what I wanted becuase I wanted a 32 bit number. However KISS FFT uses a 64-bit overflow when yo do this, setting FIXED_POINT=1 decreased my time to ~ 800 usec. / 512-pt FFT.

2.) Moving my arrays from shared DDR, M3, M2 to shared cacheable DDR, M3, M2 gets me to ~ 83 usec. / 512-pt FFT. in fact when using the cacheable areas of memory DDR vs. M2 makes no noticeable difference.

/zpyatt

View solution in original post

zpyatt · ‎02-13-2012

I realized I didn't have FIXED_POINT defined as 32 so it was defaulting to 16 bit integers. Going to 32 bits improved performance to 2.04 ms per 512-point FFT in M3, in M2 it takes ~ 1.63 ms. Still nowhere near the 200 usec I really need.

J2MEJediMaster · ‎02-14-2012

Have you taken care of memory alignment and tried unrolling any loops yet? There is an application note, AN3991, which describes how to optimize the code of complex FIR function. The material is specific to a part with an SC3850 core (the MSC8144 has a SC3400 core), but some of the information in it might still be relevant to what you are trying to do. Unfortunately, you have to sign a non-discloure agreement to see that document, so I cannot post it here. If you wish to pursue this, contact your local Freescale representative.

---Tom

zpyatt · ‎02-15-2012

Thanks for the response. I haven't messed with memory alignment or loop unrolling yet, I need to look into that, as well as using intrinsics. What sort of times would you expect to see from a 512-pt FFT on the Starcore, if you made use of all of the above?

I can't seem to find AN3991, how do you get documents that require NDA's, I was hoping to also get the one on cache optimization.

Thanks,

zpyatt

J2MEJediMaster · ‎02-16-2012

Yes, you definitely need to use the three techniques that you mentioned to accelerate the FFT code. AN3991 demonstrates using those techiques on a "vanilla" complex FIR function written in C and boosts the performance of the function by 5x.

You cannot find the app note because it is an internal document. The only way you can obtain it is by contacting your local Freescale representative and signing a non-disclousre agreement. You might also check with them to see about obtaining any app notes on writing FFTs.

---Tom

zpyatt · ‎02-22-2012

Tom, thanks for the input I really appreciate it.

So I made to massive discoveries.

1.) I was setting the compiler directive FIXED_POINT for KISS FFT to 32, I thought this was what I wanted becuase I wanted a 32 bit number. However KISS FFT uses a 64-bit overflow when yo do this, setting FIXED_POINT=1 decreased my time to ~ 800 usec. / 512-pt FFT.

2.) Moving my arrays from shared DDR, M3, M2 to shared cacheable DDR, M3, M2 gets me to ~ 83 usec. / 512-pt FFT. in fact when using the cacheable areas of memory DDR vs. M2 makes no noticeable difference.

/zpyatt

J2MEJediMaster · ‎02-23-2012

Thanks for reporting back with this information. I would suggest that you flag your report as a solution so that others facing similar optimization problems can locate your findings quickly.

---Tom