Hi, I'm using the TWR-K60F120M Tower module. I have the FPU enabled 'FPU with hard vfp passing' and ''c9x' model. I have code that performs 100,000 floating point adds in a loop. The results are as follows:
Kinetis FPU enabled ~120ms
Kinetis No FPU (software) ~500ms
Coldfire MCF52259CAG80 ~150ms
Does these numbers seems resonable. I guess I was expecting the Kinetis to be a magnitude better than the Coldfire (which does not even have an FPU). Can the Kinetis w/FPU be only marginally better that the Coldfire libraries, or do I have something not configured correctly?
Like MOST ARM instructions, the ARM website claims that most floating-point operations require 1 clock (subject to a number of caveats, of course, relative to access modes, pipeline, etc.) except: divide/sqrt at 14, and some 'dual operation' at 3, etc. I assume you are running at 120Mhz, so your 1.2us/loop indicates 144 clocks per loop. I would be curious what the assembly-code of this loop looks like! Certainly there should be a 'VADD' in there, that takes 1 clock. Note of course Cortex M4F is only a 'single precision' instruction set.
Here is the assembly listing....
for (i=0; i<100000; i++) |
a: f04f 0300 mov.w r3, #0
e: 603b | str r3, [r7, #0] | |
10: e00b | b.n 2a <main+0x2a> | |
f1 = f1 + 1.76f; |
12: ed97 7a01 vldr s14, [r7, #4]
16: eddf 7a09 vldr s15, [pc, #36] ; 3c <main+0x3c>
1a: ee77 7a27 vadd.f32 s15, s14, s15
1e: edc7 7a01 vstr s15, [r7, #4]
I don't see the 'end of loop' but I assume it is 'right there'. I have to agree that I see the 'proper' list of single-precision floating-point instructions there, which counting clocks per ARM should come out to about 10 per loop here in those instructions -- so 100K cycles should take only a million clocks, or about a 1/120th of a second here (8 [maybe 11 with overhead] milliseconds)! I am baffled -- we should certainly see the 'order of magnitude' performance increase you were expecting! Unfortunately, I don't have any 'F' Kinetis CPUs myself to play with...
Found some hardware, did my own little 'test' summing an array of 10,000 single-precision floats using RAM code and IAR tools. First, an 'optimized' software-loop took 1.7ms. Then after manually enabling the FPU (what's up with THAT?) the same add of all elements in a 10,000-word array dropped to 0.5ms. Not the 'factor of 10' you would dream of, but in such a loop the floating-point operations are now a 'smaller percentage' of the overall instruction count--even with the IAR optimization for some un-rolling that makes the loop look like this, where only 1/4 of the instructions are 'VADD' (8 per outer loop):
for( uint32_t i=sizeof(Farray)/sizeof(float);i>0;i--)
0x1fff1958: 0x4638 MOV R0, R7
0x1fff195a: 0xf240 0x41e2 MOVW R1, #1250 ; 0x4e2
accum += Farray[i];
??main_2:
0x1fff195e: 0x19aa ADDS R2, R5, R6
0x1fff1960: 0xed92 0x0a00 VLDR S0, [R2]
0x1fff1964: 0xedd0 0x0a00 VLDR S1, [R0]
0x1fff1968: 0xee30 0x0a20 VADD.F32 S0, S0, S1
0x1fff196c: 0x1f02 SUBS R2, R0, #4
0x1fff196e: 0xedd2 0x0a00 VLDR S1, [R2]
0x1fff1972: 0xf1a0 0x0208 SUB.W R2, R0, #8
0x1fff1976: 0xee30 0x0a20 VADD.F32 S0, S0, S1
0x1fff197a: 0xedd2 0x0a00 VLDR S1, [R2]
0x1fff197e: 0xf1a0 0x020c SUB.W R2, R0, #12 ; 0xc
0x1fff1982: 0xee30 0x0a20 VADD.F32 S0, S0, S1
0x1fff1986: 0xedd2 0x0a00 VLDR S1, [R2]
0x1fff198a: 0xf1a0 0x0210 SUB.W R2, R0, #16 ; 0x10
0x1fff198e: 0xee30 0x0a20 VADD.F32 S0, S0, S1
0x1fff1992: 0xedd2 0x0a00 VLDR S1, [R2]
0x1fff1996: 0xf1a0 0x0214 SUB.W R2, R0, #20 ; 0x14
0x1fff199a: 0xee30 0x0a20 VADD.F32 S0, S0, S1
0x1fff199e: 0xedd2 0x0a00 VLDR S1, [R2]
0x1fff19a2: 0xf1a0 0x0218 SUB.W R2, R0, #24 ; 0x18
0x1fff19a6: 0xee30 0x0a20 VADD.F32 S0, S0, S1
0x1fff19aa: 0xedd2 0x0a00 VLDR S1, [R2]
0x1fff19ae: 0xf1a0 0x021c SUB.W R2, R0, #28 ; 0x1c
for( uint32_t i=sizeof(Farray)/sizeof(float);i>0;i--)
0x1fff19b2: 0x3820 SUBS R0, R0, #32 ; 0x20
0x1fff19b4: 0xee30 0x0a20 VADD.F32 S0, S0, S1
0x1fff19b8: 0xedd2 0x0a00 VLDR S1, [R2]
0x1fff19bc: 0x19aa ADDS R2, R5, R6
0x1fff19be: 0x1e49 SUBS R1, R1, #1
0x1fff19c0: 0xee30 0x0a20 VADD.F32 S0, S0, S1
0x1fff19c4: 0xed82 0x0a00 VSTR S0, [R2, #0]
for( uint32_t i=sizeof(Farray)/sizeof(float);i>0;i--)
0x1fff19c8: 0xd1c9 BNE.N ??main_2 ; 0x1fff195e
The total instruction count in the loop is 30, and 1250 iterations is 37,500 total instructions. At 120MHz, that would 'ideally' have taken 0.3ms assuming 1 clock each, so I suppose we can 'write off' a 40% 'clock overhead' in RAM access and pipeline stalls.
So the 'bottom line' is that a factor-of-four may indeed be about what improvement you can expect in an overall compute-intensive sequence -- and what THAT means is that the 'software library' is actually pretty darn good(!) -- like 10 to 15 clocks for the single-precision add.
Some other benchmarks with the same loop:
32-bit integers 0.9ms(???) -- Must be RAM code-fetch getting in the way??? From ROM = 0.3ms. SP float from ROM = 2.2ms.
And not surprisingly double-precision float (double) takes 5.5ms with or without FPU, so apparently a SP FPU is of 'no help' in double-float math. Double from ROM takes3.2ms
> after manually enabling the FPU (what's up with THAT?)
Try adding this define to your compiler preprocessor defines:
__VFPV4__
I was having problems compiling <cmath> and other issues, and doing that seemed to have solved it... and it looks like __fp_init() is defined in with that symbol in __arm_eabi_init.c