RESTART: Which is faster: floating point division or casting to 64-bit int?

myke_predko · ‎08-24-2021

I'm requesting that this thread be restarted because every post to it results in "This widget could not be displayed."

Could you please repost the original question and replies?

To be honest, this is more for intellectual curiosity than anything else.

Thanx!

ErichStyger · ‎08-24-2021

I'm not able to access the original post and answers, and I'm getting the same error message.

As for the original question: that would depend on the core used, and if 'floating point' is double or single precision. If it has an FPU available, my take is that using the FPU will be the fastest method anyway, but to be sure you would have to look at the generated instructions and measure them, for example with the cylce counter (https://mcuoneclipse.com/2018/06/28/measuring-arm-cortex-m-cpu-cycles-spent-with-the-mcuxpresso-ecli... ).

myke_predko · ‎08-25-2021

Hi @ErichStyger

The reason why I asked because (according to the subject) it was comparing floating point division to 64bit integer division.

I'm curious to see what were the comments.

danielchen · ‎08-25-2021

The original post is:

Regards

Daniel

myke_predko · ‎08-25-2021

Thank you @danielchen

These questions are interesting because they're multi-dimensional and dependent on how much control you want over the process.

I just did some research and the answer isn't obvious - I would probably recommend building a test application and trying out different methods to find out which method is fastest as well as the most accurate (see below).

The approach *I* would try, after characterizing (timing) the two examples you listed would be to break "scaleFactor" into high and low 32bit parts, do the floating multiplication on everything and, finally, add the products together after they're converted from floats to 64bit integers.

float    scaleFactorHigh = (float)(scaleFactor >> 32);
float    scaleFactorLow  = (float)(scaleFactor & 0xFFFFFF);
float    abRatio         = (float)(A / B);
uint64_t result          = ((uint64_t)(scaleFactorHigh * abRatio) << 32) + 
                            (uint64_t)(scaleFactorLow  * abRatio));

The big issue that I can see with this approach is what version of the M4 "VCT" instruction (convert float to int) does the compiler use? Straight "VCT" truncates the product while "VCTR" rounds it (which is what you want in this case). That could introduce an error. I have made all types float until calculating "result" because there seems to be an ineffiiciency multiplying floats and integers together.

If speed was of the absolute essence along with absolutely accurate values, I would write the above statements in assembler making sure that the errors that occur in the conversion from integers to floats is minimized and no clock cycle costrly instruction cycles are used.