How to improve i.MX RT1024 RFFT execution time too long?

Felix_ar · ‎11-30-2023

Hi All

My customer had tried to use the ARM CMSIS DSP library, when use RFFT function discovery execution time too long compare with another brand MCU Cortex-M4.

Another brand MCU Cortex-M4 80Mhz execution time around 207us.

i.MX RT1024 Cortex-M7 500Mhz execution time is 850us~900us.

How to improve execution time on i.MX RT1024?

The target execution time is 33us or less.

below is example code and oscilloscope result.

Thanks.

mjbcswitzerland · ‎11-30-2023

Hi

I don't have a comparison with the RFFT but do have one with floating point FFT where the same code runs on a 120MHz Cortex m4 (K64) and the i.MX RT 1024 at 500MHz.
The values are for different FFT lengths and the times are for three steps in the process:
- converting sample to floating point buffer
- perform the in-place complex FFT (arm_cfft_f32()
- calculate the magnitude of output vectors

FFT Length	FFT processing time on 120MHz K64 (m4) with SW floating point operations	FFT processing time on 120MHz K64 (m4) with HW FPU operations (us)	FFT processing time on 500MHz i.MX RT 1024 (m7) with HW FPU operations (us) with code in ITC
16	2.6/104/53	3.2/10.8/28	0.3/1.4/1.2
32	4.6/277/106	5.5/22.4/58	0.47/3.0/2.2
64	8.6/641/211	10.7/40.4/112	0.85/5.5/4.3
128	17/1712/421	22/108/218	1.6/13.9/10.3
256	33/4057/841	43/236/433	3.1/33.4/16.8
512	65/10540/1683	88/427/865	5.3/64/34.6
1024	131/21170/3360	173/1073/1730	10.3/155/66
2048	260/43110/-	345/2439/-	26.4/335/34.5
4096	516/63330/-	686/4340/-	-

This should give an idea of the performance improvement expected, whereby the use of the FPU is important (notice that SW implementation on the 120MHz K64 is some 20x slower than when using its single-precision FPU). The 1024 (at 500Hz) is about 13x faster again when performing a 1024 point transformation.

Based on this, I would expect that you can get your operation (207us on 80MHz M4) down to about 16us on the 500MHz i.MX RT 1024.
Make sure it is using its FPU and running the code in ITC (and data in DTC) for optimal efficiency. See this video as guide: https://www.youtube.com/watch?v=fnfLQ-nbscI

Regards

Mark

P.S.: I also have a comparison for some FFT lengths when run on the 48MHz KL27 (Cortex-m0+), which has no FPU

FFT Length	FFT processing time on 48MHz KL27 (m0+) in us
16	11.3/510/236
32	20.9/1355/471
64	78.6/3202/942
128	152/8341/1883

For our discounted i.MX and Kinetis stock availability see https://www.utasker.com/Shop/semi.html

View solution in original post

mjbcswitzerland · ‎11-30-2023

Hi

I don't have a comparison with the RFFT but do have one with floating point FFT where the same code runs on a 120MHz Cortex m4 (K64) and the i.MX RT 1024 at 500MHz.
The values are for different FFT lengths and the times are for three steps in the process:
- converting sample to floating point buffer
- perform the in-place complex FFT (arm_cfft_f32()
- calculate the magnitude of output vectors

FFT Length	FFT processing time on 120MHz K64 (m4) with SW floating point operations	FFT processing time on 120MHz K64 (m4) with HW FPU operations (us)	FFT processing time on 500MHz i.MX RT 1024 (m7) with HW FPU operations (us) with code in ITC
16	2.6/104/53	3.2/10.8/28	0.3/1.4/1.2
32	4.6/277/106	5.5/22.4/58	0.47/3.0/2.2
64	8.6/641/211	10.7/40.4/112	0.85/5.5/4.3
128	17/1712/421	22/108/218	1.6/13.9/10.3
256	33/4057/841	43/236/433	3.1/33.4/16.8
512	65/10540/1683	88/427/865	5.3/64/34.6
1024	131/21170/3360	173/1073/1730	10.3/155/66
2048	260/43110/-	345/2439/-	26.4/335/34.5
4096	516/63330/-	686/4340/-	-

This should give an idea of the performance improvement expected, whereby the use of the FPU is important (notice that SW implementation on the 120MHz K64 is some 20x slower than when using its single-precision FPU). The 1024 (at 500Hz) is about 13x faster again when performing a 1024 point transformation.

Based on this, I would expect that you can get your operation (207us on 80MHz M4) down to about 16us on the 500MHz i.MX RT 1024.
Make sure it is using its FPU and running the code in ITC (and data in DTC) for optimal efficiency. See this video as guide: https://www.youtube.com/watch?v=fnfLQ-nbscI

Regards

Mark

P.S.: I also have a comparison for some FFT lengths when run on the 48MHz KL27 (Cortex-m0+), which has no FPU

FFT Length	FFT processing time on 48MHz KL27 (m0+) in us
16	11.3/510/236
32	20.9/1355/471
64	78.6/3202/942
128	152/8341/1883

For our discounted i.MX and Kinetis stock availability see https://www.utasker.com/Shop/semi.html

Felix_ar · ‎12-06-2023

Hi Mark

Thanks for your reply.

How to improve i.MX RT1024 RFFT execution time too long?

How to improve i.MX RT1024 RFFT execution time too long?

i.MXRT 102x