There are two very slow floating point add/sub cases. 1) when one addend is 2^(mantissa_bits-1) times larger than another one 2) when you subtract very close numbers, so that result is 2^(mantissa_bits-1) times smaller than one of addends, but not zero. Case 1 wastes cycles denormalizing (shifting right) smaller addend. Case 2 wastes cycles denormalizing result of addition. For example single precission 1.0 + 0.0000001 should be almost the slowest case. 1.0 - 0.9999999 also should be very slow. 1.0 + 1.0 should be almost the fastest case. Adding zero or very small number (smaller more than 2^(mantissa_bits-1) times) should be also fast, could be even faster than 1+1.
Cycles wasted by FP mul should almost not depend on arguments. Zeros or overflow are special case, should be faster than usual.
Now keeping above in mind, simply use some hardware timer ticking at bus clock rate or prescaled bus clock. Reading timer counter before FP add/mul, after add/mul, and taking the difference of timer counter readings, you may measure wasted bus clock cycles and characterise not only FP add/mul, but also other your custom routines. Good luck.