Consider the simple calculation:

result = (A/B)*scaleFactor

where all quantities are unsigned, 32-bit integers.

A/B is strictly less than 1, *but *

A*scaleFactor could be larger than 2^32.

Therefore, it seems I have two options:

result = ((float)(A)/B)*scaleFactor

*or*

result = ((uint64_t)(A)*scaleFactor)/B

Which sequence of operations is faster on the K64?

My guess was that your first version with the single precision float operation will be faster, as the ARM Cortex-M4F can use the FPU for this, while your 64bit version requires a library runtime routine which very likely is slower.

All depends to some extend to compiler optimizations and value of data.

To verify this I quickly did this:

The MCUXpresso IDE has a nice feature to measure/count the CPU cycles spent (see Measuring ARM Cortex-M CPU Cycles Spent with the MCUXpresso Eclipse Registers View | MCU on Eclipse ) based on the ARM cycle counter (see here: Cycle Counting on ARM Cortex-M with DWT | MCU on Eclipse ).

So I quickly measured the call (with the overhead) to the two functions, and I had the following numbers of cylces:

Test1: 0x47 cycles

Test2: 0xCC cycles

I hope this helps,

Erich