How to improve i.MX RT1024 RFFT execution time too long?

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

How to improve i.MX RT1024 RFFT execution time too long?

Jump to solution
572 Views
Felix_ar
Contributor III

Hi All

My customer had tried to use the ARM CMSIS DSP library, when use RFFT function discovery execution time too long compare with another brand MCU Cortex-M4.

Another brand MCU Cortex-M4 80Mhz execution time around 207us.

i.MX RT1024 Cortex-M7 500Mhz execution time is 850us~900us.

How to improve execution time on i.MX RT1024?

The target execution time is 33us or less.

below is example code and oscilloscope result.

code1.png

scope.png

 Thanks.

Labels (1)
0 Kudos
1 Solution
548 Views
mjbcswitzerland
Specialist V

Hi

I don't have a comparison with the RFFT but do have one with floating point FFT where the same code runs on a 120MHz Cortex m4 (K64) and the i.MX RT 1024 at 500MHz.
The values are for different FFT lengths and the times are for three steps in the process:
- converting sample to floating point buffer
- perform the in-place complex FFT (arm_cfft_f32()
- calculate the magnitude of output vectors

FFT Length

FFT processing time on 120MHz K64 (m4) with SW floating point operations

FFT processing time on 120MHz K64 (m4) with HW FPU operations (us)

FFT processing time on 500MHz i.MX RT 1024 (m7) with HW FPU operations (us) with code in ITC

16

2.6/104/53

3.2/10.8/28

0.3/1.4/1.2

32

4.6/277/106

5.5/22.4/58

0.47/3.0/2.2

64

8.6/641/211

10.7/40.4/112

0.85/5.5/4.3

128

17/1712/421

22/108/218

1.6/13.9/10.3

256

33/4057/841

43/236/433

3.1/33.4/16.8

512

65/10540/1683

88/427/865

5.3/64/34.6

1024

131/21170/3360

173/1073/1730

10.3/155/66

2048

260/43110/-

345/2439/-

26.4/335/34.5

4096

516/63330/-

686/4340/-

-

 

This should give an idea of the performance improvement expected, whereby the use of the FPU is important (notice that SW implementation on the 120MHz K64 is some 20x slower than when using its single-precision FPU). The 1024 (at 500Hz) is about 13x faster again when performing a 1024 point transformation.

Based on this, I would expect that you can get your operation (207us on 80MHz M4) down to about 16us on the 500MHz i.MX RT 1024.
Make sure it is using its FPU and running the code in ITC (and data in DTC) for optimal efficiency. See this video as guide: https://www.youtube.com/watch?v=fnfLQ-nbscI

Regards

Mark

P.S.: I also have a comparison for some FFT lengths when run on the 48MHz KL27 (Cortex-m0+), which has no FPU

FFT Length

FFT processing time on 48MHz KL27 (m0+) in us

16

11.3/510/236

32

20.9/1355/471

64

78.6/3202/942

128

152/8341/1883




For our discounted i.MX and Kinetis stock availability see https://www.utasker.com/Shop/semi.html

View solution in original post

0 Kudos
2 Replies
549 Views
mjbcswitzerland
Specialist V

Hi

I don't have a comparison with the RFFT but do have one with floating point FFT where the same code runs on a 120MHz Cortex m4 (K64) and the i.MX RT 1024 at 500MHz.
The values are for different FFT lengths and the times are for three steps in the process:
- converting sample to floating point buffer
- perform the in-place complex FFT (arm_cfft_f32()
- calculate the magnitude of output vectors

FFT Length

FFT processing time on 120MHz K64 (m4) with SW floating point operations

FFT processing time on 120MHz K64 (m4) with HW FPU operations (us)

FFT processing time on 500MHz i.MX RT 1024 (m7) with HW FPU operations (us) with code in ITC

16

2.6/104/53

3.2/10.8/28

0.3/1.4/1.2

32

4.6/277/106

5.5/22.4/58

0.47/3.0/2.2

64

8.6/641/211

10.7/40.4/112

0.85/5.5/4.3

128

17/1712/421

22/108/218

1.6/13.9/10.3

256

33/4057/841

43/236/433

3.1/33.4/16.8

512

65/10540/1683

88/427/865

5.3/64/34.6

1024

131/21170/3360

173/1073/1730

10.3/155/66

2048

260/43110/-

345/2439/-

26.4/335/34.5

4096

516/63330/-

686/4340/-

-

 

This should give an idea of the performance improvement expected, whereby the use of the FPU is important (notice that SW implementation on the 120MHz K64 is some 20x slower than when using its single-precision FPU). The 1024 (at 500Hz) is about 13x faster again when performing a 1024 point transformation.

Based on this, I would expect that you can get your operation (207us on 80MHz M4) down to about 16us on the 500MHz i.MX RT 1024.
Make sure it is using its FPU and running the code in ITC (and data in DTC) for optimal efficiency. See this video as guide: https://www.youtube.com/watch?v=fnfLQ-nbscI

Regards

Mark

P.S.: I also have a comparison for some FFT lengths when run on the 48MHz KL27 (Cortex-m0+), which has no FPU

FFT Length

FFT processing time on 48MHz KL27 (m0+) in us

16

11.3/510/236

32

20.9/1355/471

64

78.6/3202/942

128

152/8341/1883




For our discounted i.MX and Kinetis stock availability see https://www.utasker.com/Shop/semi.html

0 Kudos
476 Views
Felix_ar
Contributor III

Hi Mark

Thanks for your reply.

0 Kudos