Coldfire DIVU, DIVS, REMU, REMS Execution Time

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Coldfire DIVU, DIVS, REMU, REMS Execution Time

1,328 Views
TomE
Specialist II

This is a question that should have been asked and answered a decade or more ago. If I'm lucky, someone has already measured this properly.

With embedded processors it is always a good idea to avoid the "divide" instructions if you need performance. They're necessarily slow [1].

I was trying to speed some code up on an MCF5235, and noticed a whole lot of divide operation. So I optimised the code to reduce the number - and the code got SLOWER as a result. So maybe they're not as slow as I though.

Time to hit the Reference Manuals. But like almost every time I do this, it seems the answer I find depends on WHICH manual I read. I don't know if the differences I'm finding are due to actual differences in the CPU implementation, or whether these are cut-and-paste problems with the manuals.

I've found three slightly different answers to these questions:

"Version 3 Coldfire Core User's Manual", "MCF5485RM.pdf"

DIVS.W, DIVU.W: 20 clocks. DIVS.L, DIVU.L, REMS.L, REMU.L: 35 clocks.

"MCF5329RM.pdf", "MCF54455RM.pdf", "MCF52277RM.pdf"

DIVS.W, DIVU.W: 20 clocks. DIVS.L, DIVU.L, REMS.L, REMU.L: <=35 clocks.

"MCF5235RM.pdf", and ditto for  MCF5271, 13, 70, 71, 75, 82

DIVS.W[1], DIVU.W[1]: 20 clocks. DIVS.L[1], DIVU.L[1], REMS.L[1], REMU.L[1]: <=35 clocks.

Note 1: For divide and remainder instructions the times listed represent the worst-case timing.
Depending on the operand values, the actual execution time may be less.

The latter one should have "<=20 as well as "<=35" to be consistent, which make me think the above differences are just due to edits in the different versions of the manuals and not in the actual chips. Or not. I don't know.

Does anyone know the dependency on the "operand values"?

Has anyone actually MEASURED the execution times of these instructions with different "operand values" to see which ones cause any "early exit"? I'm guessing "1 / 1" might run faster than one that has lots more bits, but which ones?

Note 1: Those of us who grew up with the MC68000 were particularly sensitive to this. The MC68000, running between 8MHz and a maximum of 20MHz took UP TO 158 clock cycles to calculate DIVS. MUL took up to 70 clocks, so we (and good compilers) avoided that as well. The Coldfire is way faster than that (less cycles and faster clock).

http://www.tomshardware.com/forum/16223-13-68000-divu-divs-cycle-accurate-timing

The above code is interesting. It shows how complicated the timing can be depending on the operands. It says the Divide operations take 10 clocks for overflow and 76-136 for DIVU, and 18 for overflow and 122-156 for DIVS. That's interesting because it flatly contradicts the MC68000 Reference Manual's assertion that "The divide algorithm used by the MC68000 provides less than 10% difference between the best- and worst-case timings".

Tom

0 Kudos
Reply
3 Replies

950 Views
fangli
NXP Employee
NXP Employee

Hi, Tom

I am sorry that I am not familiar with the MC68xxx family. From what I know the DIV.l execution timing for ColdFire parts takes up to 35 or 38 clocks depending on the addressing mode. The divide instruction is a special case where it was quoted the worst case timing.

0 Kudos
Reply

950 Views
TomE
Specialist II

> I am sorry that I am not familiar with the MC68xxx family.

I was not asking about the 68000. I was referring to it as an example of where the Reference Manual was a lot better written than the ColdFire ones. Firstly it consistently and correctly documents THAT the instructions have "worst case execution timings". Secondly it DETAILS the exact timing for the Multiply instruction. What you learn about the ColdFire depends on WHICH manual you read!  If you read the 5485 manual you'd believe the timings are fixed. For the 5329 you'd believe the "W" ones are fixed and the "L" ones are variable. The 5235 manual is the only one that tells you the "truth". None of them give details like the older manuals used to.

> The divide instruction is a special case where it was quoted the worst case timing.

That is exactly what I said in my post.

NXP must know the actual data-dependent execution times, like they documented for the 68000 Multiply. This shouldn't be a "secret".

For anybody interested in CPU architecture, I would recommend reading "Table 8-4. Standard Instruction Execution Times" in "MC68000UM.pdf" as it gives very good detail on how long the Multiply instructions take. Which is "38 + 2N Clocks" where "N" is the number of "ones" for "MULU" and the number of "01" or "10" pairs for "MULS". That's the sort of detail that is useful, and makes you think about what the hardware is doing.

The link to the "cycle accurate timing" code gives an interesting insight into how complicated something like a Multiply or Divide circuit can be. It also (as stated) says the MC68000 User Manual is lying with its statement on timing, so even it wasn't "perfect".

Tom

0 Kudos
Reply

950 Views
TomE
Specialist II

I've been taking some measurements.

"0x55555555 / 1" takes 35 clocks.

"0 / 1" takes 9 clocks.

"0xffff / 1" takes 21 clocks.

"0xffffff / 1" takes 29 clocks.

"0x40000000 / 1" takes 35 clocks.

"0x40000000 / 0x100" takes 34 clocks.

With all of the above tests the compiler was generating "DIVS.L %d3,%d0" instructions.

So 35 clocks max, and the less high bits in the numerator, the faster it gets.

So slow enough that it is worth avoiding if you can (like multiply by inverse and so on).

In the code I was looking at, the divide is called as part of variable scaling in a product that is user-configurable. And in most cases the variable isn't scaled. So I can make these operations faster in my application by changing:

result = num / dem;

to

result = (dem == 1) ? num : num / dem;

Tom

0 Kudos
Reply