Hardware multiply and divide

lpcware · ‎06-15-2016

Content originally posted in LPCWare by JohnR on Tue Aug 10 05:21:31 MST 2010
Hi,

Probably a stupid question but do the LPC1311/43 series contain a hardware multiplier or dvider. There is no mention of one in the product literature but since the corresponding ARM Cortex-M3 offerings from ST and TI(LuminaryMicro) do have multipliers and, in some cases, dividers, I wondered if these capabilities had been overlooked in the NXP literature?

John.

Original Attachment has been moved to: 1100591_LPC17xx.txt.zip

lpcware · ‎06-15-2016

Content originally posted in LPCWare by CodeRedSupport on Mon Jun 10 02:21:26 MST 2013
Your 0x555 example looks like an optimisation oddity when compiling -Os. If you compile at other optimisation levels (e.g. [FONT=Courier New]-O0[/FONT] or [FONT=Courier New]-O2[/FONT]), then a mul instruction will get used.

Regards,
CodeRedSupport

lpcware · ‎06-15-2016

Content originally posted in LPCWare by cyberstudio on Fri Jun 07 16:38:37 MST 2013
Thank you for your reply, technical support.

I used the example you provided and did a lot of experimentation. It seems like the compiler is using the following rules.

1, if the constant fits within 8 bits and at least 2 bits are ones, a muls instruction is issued. This proves the compiler knows about fast multiply.

2, if the constant does not fit within 8 bits, and there are only 2 ones, shifts and adds are generated. Only 3 instructions so this is the same as or cheaper than muls because the constant cannot be loaded with a single instruction anyway.

So, to generate the "strange" case, the constant must not fit within 8 bits, and there must be more than 3 ones. BUT, here is the strange constant.

#define MYCONST 0x555U
int main(int x) {
return x * MYCONST ;
 300:0043      lslsr3, r0, #1
 302:1818      addsr0, r3, r0
 304:00c3      lslsr3, r0, #3
 306:1a18      subsr0, r3, r0
 308:0183      lslsr3, r0, #6
 30a:18c0      addsr0, r0, r3
}
 30c:4770      bxlr

I really doubt if a load instruction is so expensive that 6 shift/add instructions are better.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by CodeRedSupport on Fri Jun 07 13:15:46 MST 2013
I'm not sure why you are complaining. As far as I can tell the compiler shipped in LPCXpresso 5.2.4 will use a MUL instruction for multiply by constants except in very simple cases. But where it uses a shift (plus potentially an addition/subtraction) as far as I can tell this does not increase the cycle count or the code size. Note that compilers for ARM have always historically done multiplies by powers of 2 (+/- 1) in this way.

For example if I compile this simple function...

int mymul (int x) {
return (x * MYCONST);
}

this compiles to

#define MYCONST 42
  00000000 <mymul>:
   0:232a      movsr3, #42; 0x2a
   2:4358      mulsr0, r3
   4:4770      bxlr

or

#define MYCONST 3
00000000 <mymul>:
   0:0043      lslsr3, r0, #1
   2:1818      addsr0, r3, r0
   4:4770      bxlr

If you think there is a problem, then please provide an example.

Regards,
CodeRedSupport

lpcware · ‎06-15-2016

Content originally posted in LPCWare by cyberstudio on Thu Jun 06 17:58:25 MST 2013
I am using LPCXpresso 5.2.4 too, and just to make sure we are talking about the same thing,
1, the chip I used was LPC1114 (this discussion does NOT apply to Cortex M3), and,
2, the multiplied constant must be simple, with only 2 or 3 ones in the multiplied constant, otherwise the compiler would reason that MULS is cheaper for multiplying a complicated constant than a series of shifts and adds.

I searched through my disassmebly listing - the only time the compiler generate a plain MULS was when both operands are variables but not constants.

We know the root cause of this is the compiler assumes MULS takes 32 cycles to execute when in reality it takes only one on LPC1114. Even if gcc now supports Cortex M0, it still needs to be told that LPC1114's particular implementation is single cycle not 32, and it seems like no one has ever told the compiler about that, no?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by R2D2 on Thu Jun 06 14:42:07 MST 2013

Quote: cyberstudio
Since 5.1.0, LPCXpresso has been using gcc 4.6.2, so it is supposed to have Cortex M0 support, but it is still generating many shifts instead of using the single cycle multiply.

It is up to the chip implementor to choose between the fast multiplier and the slow multiplier, so there's got to be a way to tell the compiler the choice, so there's got to be a compiler command-line option to differentiate between the two, no?

:confused:

I'm not sure what you are talking about. My LPCXpresso v5.2.4 [Build 2122] [2013-04-29] is using a clear MULS. Are we talking about LPCXpresso?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by cyberstudio on Thu Jun 06 14:30:33 MST 2013
Since 5.1.0, LPCXpresso has been using gcc 4.6.2, so it is supposed to have Cortex M0 support, but it is still generating many shifts instead of using the single cycle multiply.

It is up to the chip implementor to choose between the fast multiplier and the slow multiplier, so there's got to be a way to tell the compiler the choice, so there's got to be a compiler command-line option to differentiate between the two, no?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by CodeRedSupport on Fri Jan 07 09:02:18 MST 2011
I believe that the "mainstream" version of gcc 4.3.3 did not have Cortex-M0 support. However the version that is used by LPCXPresso does.

Regards,
CodeRedSupport

lpcware · ‎06-15-2016

Content originally posted in LPCWare by fastmapper on Thu Jan 06 22:37:31 MST 2011

Quote: CodeRedSupport
Unfortunately, GCC currently assumes that you have the slower multiplier (even though the LPC11xx actually uses the fast implementation), and thus when it sees a multiplication by a constant, it will do this using adds/shifts - which it thinks will be faster.

Unfortunately I don't believe there is currently any way of modifiying this behaviour by the compiler.

My understanding is that LPCxpresso by default uses GCC version 4.3.3 and Cortex-M0 support was added with GCC version 4.5.0. I expect this may have some influence on code generation for Cortex-M0.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by CodeRedSupport on Thu Jan 06 04:44:14 MST 2011
The Cortex-M0 has a choice of multiply implementations as the following information from the Cortex-M0 Technical Reference Manual states:

Quote:
The MULS instruction provides a 32-bit x 32-bit multiply that yields the least-significant 32-bits. The processor can implement MULS in one of two ways:
• as a fast single-cycle array
• as a 32-cycle iterative multiplier.

[Links to ARM documentation at http://support.code-red-tech.com/CodeRedWiki/ArmCpuInfo]

Unfortunately, GCC currently assumes that you have the slower multiplier (even though the LPC11xx actually uses the fast implementation), and thus when it sees a multiplication by a constant, it will do this using adds/shifts - which it thinks will be faster.

Unfortunately I don't believe there is currently any way of modifiying this behaviour by the compiler.

[Note that multiplying two variables together will cause the generation of a MULS instruction, as in such cases the compiler has no way of knowing what combination of add/shifts to use.]

Regards,
CodeRedSupport

lpcware · ‎06-15-2016

Content originally posted in LPCWare by cyberstudio on Wed Jan 05 20:41:51 MST 2011
I just looked at some assembly code from lpcxpresso. The test is to multiply a constant to a 32-bit integer. The compiler turns the constant multiply to up to about 3 sets shift and add-or-subtract instructions. The compiler does an outstanding job of finding the smallest chunk of shift and add/subtract instructions, but if multiply is single cycle on LPC1114, it would always be cheaper than any shift/add combination.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by cyberstudio on Wed Jan 05 17:56:48 MST 2011
Does LPC1114 have a single cycle multiply or a 32-cycle multiply? Wasn't clear to me from the user manual, with ARM saying 1 cycle or 32-cycle depending on (NXP's) multiplier implementation.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by JohnR on Wed Aug 11 05:41:00 MST 2010
Thank you all.

Maybe somebody could nudge NXP into updating their specsheets for the -M3 devices with the fact that the hardware multiplier/divider is present. This omission had me sidelining NXP for use in a new project for the apparent lack of this capability.

John.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by leon_heller on Tue Aug 10 07:11:27 MST 2010
They have a Cortex-M3 core so will have hardware divide and 32-bit multiply:

http://www.arm.com/products/processors/cortex-m/cortex-m3.php

lpcware · ‎06-15-2016

Content originally posted in LPCWare by igorsk on Tue Aug 10 07:00:58 MST 2010
Cortex-M0 (such as LPC1xxx) has only 32x32->32 bit multiply (MUL). Cortex-M3 chips additionally support multiply-accumulate/subtract (MLA/MLS), 32x32->64 multiply (SMULL/UMULL) and hardware divide (SDIV/UDIV). Interestingly, hardware divide is NOT present in the "big" cores like Cortex-A8.
For more details, see ARMv7M Architecture Reference Manual (DDI 0403).

lpcware · ‎06-15-2016

Content originally posted in LPCWare by CodeRedSupport on Tue Aug 10 05:34:57 MST 2010
Yes, hardware divide and multiply are standard on the Cortex-M3.

Regards,
CodeRedSupport