Kinetis K60 FPU Benchmark

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Kinetis K60 FPU Benchmark

1,437 Views
mon3al
Contributor I

Hi, I'm using the TWR-K60F120M Tower module. I have the FPU enabled 'FPU with hard vfp passing' and ''c9x' model. I have code that performs 100,000 floating point adds in a loop. The results are as follows:


Kinetis FPU enabled               ~120ms

Kinetis No FPU (software)     ~500ms

Coldfire MCF52259CAG80     ~150ms


Does these numbers seems resonable. I guess I was expecting the Kinetis to be a magnitude better than the Coldfire (which does not even have an FPU). Can the Kinetis w/FPU be only marginally better that the Coldfire libraries, or do I have something not configured correctly?

Labels (1)
0 Kudos
5 Replies

632 Views
egoodii
Senior Contributor III

Like MOST ARM instructions, the ARM website claims that most floating-point operations require 1 clock (subject to a number of caveats, of course, relative to access modes, pipeline, etc.) except: divide/sqrt at 14, and some 'dual operation' at 3, etc.  I assume you are running at 120Mhz, so your 1.2us/loop indicates 144 clocks per loop.  I would be curious what the assembly-code of this loop looks like!  Certainly there should be a 'VADD' in there, that takes 1 clock.  Note of course Cortex M4F is only a 'single precision' instruction set.

0 Kudos

632 Views
mon3al
Contributor I

Here is the assembly listing....

for (i=0; i<100000; i++)

   a:    f04f 0300     mov.w    r3, #0

   e:    603b str    r3, [r7, #0]
  10:    e00b b.n    2a <main+0x2a>
f1 = f1 + 1.76f;

  12:    ed97 7a01     vldr    s14, [r7, #4]

  16:    eddf 7a09     vldr    s15, [pc, #36]    ; 3c <main+0x3c>

  1a:    ee77 7a27     vadd.f32    s15, s14, s15

  1e:    edc7 7a01     vstr    s15, [r7, #4]

0 Kudos

632 Views
egoodii
Senior Contributor III

I don't see the 'end of loop' but I assume it is 'right there'.  I have to agree that I see the 'proper' list of single-precision floating-point instructions there, which counting clocks per ARM should come out to about 10 per loop here in those instructions -- so 100K cycles should take only a million clocks, or about a 1/120th of a second here (8 [maybe 11 with overhead] milliseconds)!  I am baffled -- we should certainly see the 'order of magnitude' performance increase you were expecting!  Unfortunately, I don't have any 'F' Kinetis CPUs myself to play with...

0 Kudos

632 Views
egoodii
Senior Contributor III

Found some hardware, did my own little 'test' summing an array of 10,000 single-precision floats using RAM code and IAR tools.  First, an 'optimized' software-loop took 1.7ms.  Then after manually enabling the FPU (what's up with THAT?) the same add of all elements in a 10,000-word array dropped to 0.5ms.  Not the 'factor of 10' you would dream of, but in such a loop the floating-point operations are now a 'smaller percentage' of the overall instruction count--even with the IAR optimization for some un-rolling that makes the loop look like this, where only 1/4 of the instructions are 'VADD' (8 per outer loop):

    for( uint32_t i=sizeof(Farray)/sizeof(float);i>0;i--)

   0x1fff1958: 0x4638         MOV       R0, R7

   0x1fff195a: 0xf240 0x41e2  MOVW      R1, #1250               ; 0x4e2

        accum += Farray[i];

??main_2:

   0x1fff195e: 0x19aa         ADDS      R2, R5, R6

   0x1fff1960: 0xed92 0x0a00  VLDR      S0, [R2]

   0x1fff1964: 0xedd0 0x0a00  VLDR      S1, [R0]

   0x1fff1968: 0xee30 0x0a20  VADD.F32  S0, S0, S1

   0x1fff196c: 0x1f02         SUBS      R2, R0, #4

   0x1fff196e: 0xedd2 0x0a00  VLDR      S1, [R2]

   0x1fff1972: 0xf1a0 0x0208  SUB.W     R2, R0, #8

   0x1fff1976: 0xee30 0x0a20  VADD.F32  S0, S0, S1

   0x1fff197a: 0xedd2 0x0a00  VLDR      S1, [R2]

   0x1fff197e: 0xf1a0 0x020c  SUB.W     R2, R0, #12             ; 0xc

   0x1fff1982: 0xee30 0x0a20  VADD.F32  S0, S0, S1

   0x1fff1986: 0xedd2 0x0a00  VLDR      S1, [R2]

   0x1fff198a: 0xf1a0 0x0210  SUB.W     R2, R0, #16             ; 0x10

   0x1fff198e: 0xee30 0x0a20  VADD.F32  S0, S0, S1

   0x1fff1992: 0xedd2 0x0a00  VLDR      S1, [R2]

   0x1fff1996: 0xf1a0 0x0214  SUB.W     R2, R0, #20             ; 0x14

   0x1fff199a: 0xee30 0x0a20  VADD.F32  S0, S0, S1

   0x1fff199e: 0xedd2 0x0a00  VLDR      S1, [R2]

   0x1fff19a2: 0xf1a0 0x0218  SUB.W     R2, R0, #24             ; 0x18

   0x1fff19a6: 0xee30 0x0a20  VADD.F32  S0, S0, S1

   0x1fff19aa: 0xedd2 0x0a00  VLDR      S1, [R2]

   0x1fff19ae: 0xf1a0 0x021c  SUB.W     R2, R0, #28             ; 0x1c

    for( uint32_t i=sizeof(Farray)/sizeof(float);i>0;i--)

   0x1fff19b2: 0x3820         SUBS      R0, R0, #32             ; 0x20

   0x1fff19b4: 0xee30 0x0a20  VADD.F32  S0, S0, S1

   0x1fff19b8: 0xedd2 0x0a00  VLDR      S1, [R2]

   0x1fff19bc: 0x19aa         ADDS      R2, R5, R6

   0x1fff19be: 0x1e49         SUBS      R1, R1, #1

   0x1fff19c0: 0xee30 0x0a20  VADD.F32  S0, S0, S1

   0x1fff19c4: 0xed82 0x0a00  VSTR      S0, [R2, #0]

    for( uint32_t i=sizeof(Farray)/sizeof(float);i>0;i--)

   0x1fff19c8: 0xd1c9         BNE.N     ??main_2                ; 0x1fff195e

The total instruction count in the loop is 30, and 1250 iterations is 37,500 total instructions.  At 120MHz, that would 'ideally' have taken 0.3ms assuming 1 clock each, so I suppose we can 'write off' a 40% 'clock overhead' in RAM access and pipeline stalls.

So the 'bottom line' is that a factor-of-four may indeed be about what improvement you can expect in an overall compute-intensive sequence -- and what THAT means is that the 'software library' is actually pretty darn good(!) -- like 10 to 15 clocks for the single-precision add.

Some other benchmarks with the same loop:

32-bit integers 0.9ms(???) -- Must be RAM code-fetch getting in the way???  From ROM = 0.3ms. SP float from ROM = 2.2ms.

And not surprisingly double-precision float (double) takes 5.5ms with or without FPU, so apparently a SP FPU is of 'no help' in double-float math.  Double from ROM takes3.2ms

0 Kudos

632 Views
bowerymarc
Contributor V

> after manually enabling the FPU (what's up with THAT?)

Try adding this define to your compiler preprocessor defines:

__VFPV4__

I was having problems compiling <cmath> and other issues, and doing that seemed to have solved it... and it looks like __fp_init() is defined in with that symbol in __arm_eabi_init.c

0 Kudos