Kinetis K60 FPU Benchmark

mon3al · ‎03-12-2014

Hi, I'm using the TWR-K60F120M Tower module. I have the FPU enabled 'FPU with hard vfp passing' and ''c9x' model. I have code that performs 100,000 floating point adds in a loop. The results are as follows:

Kinetis FPU enabled ~120ms

Kinetis No FPU (software) ~500ms

Coldfire MCF52259CAG80 ~150ms

Does these numbers seems resonable. I guess I was expecting the Kinetis to be a magnitude better than the Coldfire (which does not even have an FPU). Can the Kinetis w/FPU be only marginally better that the Coldfire libraries, or do I have something not configured correctly?

egoodii · ‎03-13-2014

Like MOST ARM instructions, the ARM website claims that most floating-point operations require 1 clock (subject to a number of caveats, of course, relative to access modes, pipeline, etc.) except: divide/sqrt at 14, and some 'dual operation' at 3, etc. I assume you are running at 120Mhz, so your 1.2us/loop indicates 144 clocks per loop. I would be curious what the assembly-code of this loop looks like! Certainly there should be a 'VADD' in there, that takes 1 clock. Note of course Cortex M4F is only a 'single precision' instruction set.

mon3al · ‎03-14-2014

Here is the assembly listing....

for (i=0; i<100000; i++)

a: f04f 0300 mov.w r3, #0

e: 603b		str r3, [r7, #0]
10: e00b		b.n 2a <main+0x2a>
		f1 = f1 + 1.76f;

12: ed97 7a01 vldr s14, [r7, #4]

16: eddf 7a09 vldr s15, [pc, #36] ; 3c <main+0x3c>

1a: ee77 7a27 vadd.f32 s15, s14, s15

1e: edc7 7a01 vstr s15, [r7, #4]

egoodii · ‎03-14-2014

I don't see the 'end of loop' but I assume it is 'right there'. I have to agree that I see the 'proper' list of single-precision floating-point instructions there, which counting clocks per ARM should come out to about 10 per loop here in those instructions -- so 100K cycles should take only a million clocks, or about a 1/120th of a second here (8 [maybe 11 with overhead] milliseconds)! I am baffled -- we should certainly see the 'order of magnitude' performance increase you were expecting! Unfortunately, I don't have any 'F' Kinetis CPUs myself to play with...

egoodii · ‎03-20-2014

Found some hardware, did my own little 'test' summing an array of 10,000 single-precision floats using RAM code and IAR tools. First, an 'optimized' software-loop took 1.7ms. Then after manually enabling the FPU (what's up with THAT?) the same add of all elements in a 10,000-word array dropped to 0.5ms. Not the 'factor of 10' you would dream of, but in such a loop the floating-point operations are now a 'smaller percentage' of the overall instruction count--even with the IAR optimization for some un-rolling that makes the loop look like this, where only 1/4 of the instructions are 'VADD' (8 per outer loop):

for( uint32_t i=sizeof(Farray)/sizeof(float);i>0;i--)

0x1fff1958: 0x4638 MOV R0, R7

0x1fff195a: 0xf240 0x41e2 MOVW R1, #1250 ; 0x4e2

accum += Farray[i];

??main_2:

0x1fff195e: 0x19aa ADDS R2, R5, R6

0x1fff1960: 0xed92 0x0a00 VLDR S0, [R2]

0x1fff1964: 0xedd0 0x0a00 VLDR S1, [R0]

0x1fff1968: 0xee30 0x0a20 VADD.F32 S0, S0, S1

0x1fff196c: 0x1f02 SUBS R2, R0, #4

0x1fff196e: 0xedd2 0x0a00 VLDR S1, [R2]

0x1fff1972: 0xf1a0 0x0208 SUB.W R2, R0, #8

0x1fff1976: 0xee30 0x0a20 VADD.F32 S0, S0, S1

0x1fff197a: 0xedd2 0x0a00 VLDR S1, [R2]

0x1fff197e: 0xf1a0 0x020c SUB.W R2, R0, #12 ; 0xc

0x1fff1982: 0xee30 0x0a20 VADD.F32 S0, S0, S1

0x1fff1986: 0xedd2 0x0a00 VLDR S1, [R2]

0x1fff198a: 0xf1a0 0x0210 SUB.W R2, R0, #16 ; 0x10

0x1fff198e: 0xee30 0x0a20 VADD.F32 S0, S0, S1

0x1fff1992: 0xedd2 0x0a00 VLDR S1, [R2]

0x1fff1996: 0xf1a0 0x0214 SUB.W R2, R0, #20 ; 0x14

0x1fff199a: 0xee30 0x0a20 VADD.F32 S0, S0, S1

0x1fff199e: 0xedd2 0x0a00 VLDR S1, [R2]

0x1fff19a2: 0xf1a0 0x0218 SUB.W R2, R0, #24 ; 0x18

0x1fff19a6: 0xee30 0x0a20 VADD.F32 S0, S0, S1

0x1fff19aa: 0xedd2 0x0a00 VLDR S1, [R2]

0x1fff19ae: 0xf1a0 0x021c SUB.W R2, R0, #28 ; 0x1c

for( uint32_t i=sizeof(Farray)/sizeof(float);i>0;i--)

0x1fff19b2: 0x3820 SUBS R0, R0, #32 ; 0x20

0x1fff19b4: 0xee30 0x0a20 VADD.F32 S0, S0, S1

0x1fff19b8: 0xedd2 0x0a00 VLDR S1, [R2]

0x1fff19bc: 0x19aa ADDS R2, R5, R6

0x1fff19be: 0x1e49 SUBS R1, R1, #1

0x1fff19c0: 0xee30 0x0a20 VADD.F32 S0, S0, S1

0x1fff19c4: 0xed82 0x0a00 VSTR S0, [R2, #0]

for( uint32_t i=sizeof(Farray)/sizeof(float);i>0;i--)

0x1fff19c8: 0xd1c9 BNE.N ??main_2 ; 0x1fff195e

The total instruction count in the loop is 30, and 1250 iterations is 37,500 total instructions. At 120MHz, that would 'ideally' have taken 0.3ms assuming 1 clock each, so I suppose we can 'write off' a 40% 'clock overhead' in RAM access and pipeline stalls.

So the 'bottom line' is that a factor-of-four may indeed be about what improvement you can expect in an overall compute-intensive sequence -- and what THAT means is that the 'software library' is actually pretty darn good(!) -- like 10 to 15 clocks for the single-precision add.

Some other benchmarks with the same loop:

32-bit integers 0.9ms(???) -- Must be RAM code-fetch getting in the way??? From ROM = 0.3ms. SP float from ROM = 2.2ms.

And not surprisingly double-precision float (double) takes 5.5ms with or without FPU, so apparently a SP FPU is of 'no help' in double-float math. Double from ROM takes3.2ms

bowerymarc · ‎03-28-2014

> after manually enabling the FPU (what's up with THAT?)

Try adding this define to your compiler preprocessor defines:

__VFPV4__

I was having problems compiling <cmath> and other issues, and doing that seemed to have solved it... and it looks like __fp_init() is defined in with that symbol in __arm_eabi_init.c

Kinetis K60 FPU Benchmark

Kinetis K60 FPU Benchmark

Kinetis K Series MCUs