Found some hardware, did my own little 'test' summing an array of 10,000 single-precision floats using RAM code and IAR tools. First, an 'optimized' software-loop took 1.7ms. Then after manually enabling the FPU (what's up with THAT?) the same add of all elements in a 10,000-word array dropped to 0.5ms. Not the 'factor of 10' you would dream of, but in such a loop the floating-point operations are now a 'smaller percentage' of the overall instruction count--even with the IAR optimization for some un-rolling that makes the loop look like this, where only 1/4 of the instructions are 'VADD' (8 per outer loop):
for( uint32_t i=sizeof(Farray)/sizeof(float);i>0;i--)
0x1fff1958: 0x4638 MOV R0, R7
0x1fff195a: 0xf240 0x41e2 MOVW R1, #1250 ; 0x4e2
accum += Farray[i];
??main_2:
0x1fff195e: 0x19aa ADDS R2, R5, R6
0x1fff1960: 0xed92 0x0a00 VLDR S0, [R2]
0x1fff1964: 0xedd0 0x0a00 VLDR S1, [R0]
0x1fff1968: 0xee30 0x0a20 VADD.F32 S0, S0, S1
0x1fff196c: 0x1f02 SUBS R2, R0, #4
0x1fff196e: 0xedd2 0x0a00 VLDR S1, [R2]
0x1fff1972: 0xf1a0 0x0208 SUB.W R2, R0, #8
0x1fff1976: 0xee30 0x0a20 VADD.F32 S0, S0, S1
0x1fff197a: 0xedd2 0x0a00 VLDR S1, [R2]
0x1fff197e: 0xf1a0 0x020c SUB.W R2, R0, #12 ; 0xc
0x1fff1982: 0xee30 0x0a20 VADD.F32 S0, S0, S1
0x1fff1986: 0xedd2 0x0a00 VLDR S1, [R2]
0x1fff198a: 0xf1a0 0x0210 SUB.W R2, R0, #16 ; 0x10
0x1fff198e: 0xee30 0x0a20 VADD.F32 S0, S0, S1
0x1fff1992: 0xedd2 0x0a00 VLDR S1, [R2]
0x1fff1996: 0xf1a0 0x0214 SUB.W R2, R0, #20 ; 0x14
0x1fff199a: 0xee30 0x0a20 VADD.F32 S0, S0, S1
0x1fff199e: 0xedd2 0x0a00 VLDR S1, [R2]
0x1fff19a2: 0xf1a0 0x0218 SUB.W R2, R0, #24 ; 0x18
0x1fff19a6: 0xee30 0x0a20 VADD.F32 S0, S0, S1
0x1fff19aa: 0xedd2 0x0a00 VLDR S1, [R2]
0x1fff19ae: 0xf1a0 0x021c SUB.W R2, R0, #28 ; 0x1c
for( uint32_t i=sizeof(Farray)/sizeof(float);i>0;i--)
0x1fff19b2: 0x3820 SUBS R0, R0, #32 ; 0x20
0x1fff19b4: 0xee30 0x0a20 VADD.F32 S0, S0, S1
0x1fff19b8: 0xedd2 0x0a00 VLDR S1, [R2]
0x1fff19bc: 0x19aa ADDS R2, R5, R6
0x1fff19be: 0x1e49 SUBS R1, R1, #1
0x1fff19c0: 0xee30 0x0a20 VADD.F32 S0, S0, S1
0x1fff19c4: 0xed82 0x0a00 VSTR S0, [R2, #0]
for( uint32_t i=sizeof(Farray)/sizeof(float);i>0;i--)
0x1fff19c8: 0xd1c9 BNE.N ??main_2 ; 0x1fff195e
The total instruction count in the loop is 30, and 1250 iterations is 37,500 total instructions. At 120MHz, that would 'ideally' have taken 0.3ms assuming 1 clock each, so I suppose we can 'write off' a 40% 'clock overhead' in RAM access and pipeline stalls.
So the 'bottom line' is that a factor-of-four may indeed be about what improvement you can expect in an overall compute-intensive sequence -- and what THAT means is that the 'software library' is actually pretty darn good(!) -- like 10 to 15 clocks for the single-precision add.
Some other benchmarks with the same loop:
32-bit integers 0.9ms(???) -- Must be RAM code-fetch getting in the way??? From ROM = 0.3ms. SP float from ROM = 2.2ms.
And not surprisingly double-precision float (double) takes 5.5ms with or without FPU, so apparently a SP FPU is of 'no help' in double-float math. Double from ROM takes3.2ms