AnsweredAssumed Answered

Coldfire DIVU, DIVS, REMU, REMS Execution Time

Question asked by TomE on Apr 5, 2017
Latest reply on Jul 13, 2017 by TomE

This is a question that should have been asked and answered a decade or more ago. If I'm lucky, someone has already measured this properly.


With embedded processors it is always a good idea to avoid the "divide" instructions if you need performance. They're necessarily slow [1].


I was trying to speed some code up on an MCF5235, and noticed a whole lot of divide operation. So I optimised the code to reduce the number - and the code got SLOWER as a result. So maybe they're not as slow as I though.


Time to hit the Reference Manuals. But like almost every time I do this, it seems the answer I find depends on WHICH manual I read. I don't know if the differences I'm finding are due to actual differences in the CPU implementation, or whether these are cut-and-paste problems with the manuals.


I've found three slightly different answers to these questions:


"Version 3 Coldfire Core User's Manual", "MCF5485RM.pdf"

DIVS.W, DIVU.W: 20 clocks. DIVS.L, DIVU.L, REMS.L, REMU.L: 35 clocks.


"MCF5329RM.pdf", "MCF54455RM.pdf", "MCF52277RM.pdf"

DIVS.W, DIVU.W: 20 clocks. DIVS.L, DIVU.L, REMS.L, REMU.L: <=35 clocks.


"MCF5235RM.pdf", and ditto for  MCF5271, 13, 70, 71, 75, 82

DIVS.W[1], DIVU.W[1]: 20 clocks. DIVS.L[1], DIVU.L[1], REMS.L[1], REMU.L[1]: <=35 clocks.

Note 1: For divide and remainder instructions the times listed represent the worst-case timing.
Depending on the operand values, the actual execution time may be less.


The latter one should have "<=20 as well as "<=35" to be consistent, which make me think the above differences are just due to edits in the different versions of the manuals and not in the actual chips. Or not. I don't know.


Does anyone know the dependency on the "operand values"?


Has anyone actually MEASURED the execution times of these instructions with different "operand values" to see which ones cause any "early exit"? I'm guessing "1 / 1" might run faster than one that has lots more bits, but which ones?


Note 1: Those of us who grew up with the MC68000 were particularly sensitive to this. The MC68000, running between 8MHz and a maximum of 20MHz took UP TO 158 clock cycles to calculate DIVS. MUL took up to 70 clocks, so we (and good compilers) avoided that as well. The Coldfire is way faster than that (less cycles and faster clock).


The above code is interesting. It shows how complicated the timing can be depending on the operands. It says the Divide operations take 10 clocks for overflow and 76-136 for DIVU, and 18 for overflow and 122-156 for DIVS. That's interesting because it flatly contradicts the MC68000 Reference Manual's assertion that "The divide algorithm used by the MC68000 provides less than 10% difference between the best- and worst-case timings".