# Multiply in ASM 16x16

cancel
Showing results for
Did you mean:

## Multiply in ASM 16x16

1,445 Views
Contributor III

Hello,

I am looking for a routine in ASM for multiply 16bits x 16bits. With the temporary results saved via the stack.

Maybe somebody can help me ?

Labels (1)
• ### General

9 Replies
348 Views
Contributor I

Here is another implementation of a 16 x 16 bit unsigned multiply.

/*
*
* void Mult16(unsigned short a, unsigned short b, unsigned long *result)
*
*  Unsigned 16 bit multiply - generates 32 bit unsigned result
*
*  Algorithm
*    Uses the algebraic formula
*      (x + y) (w + z) = xw + yz + yz + xz
*    result = ah * bh * 2^16 + (ah * bl + al * bh) * 2^8 + al * bl
*        where:  ah = high byte of a
*                al = low byte of a
*                bh = high byte of b
*                bl = low byte of b
*
*   Execution cycles - 175 maximum - includes call and return
*   Stack usage - 9 bytes
*
* Stack (add to stack offset when X is pushed on stack)
*  SP + 1,2 result address
*  SP + 3,4 return address
*  SP + 5 ah
*  SP + 6 al
*  SP + 7 bh
*  SP + 8 bl
*  */
void Mul16(unsigned short a, unsigned short b, unsigned long *result)
{
_asm {
pshh        ; C uses HX for result address argument

; // ah * bh * 2^16  calculation
tsx         ; X addressing is faster/ smaller (HX = SP + 1)
lda  4,x   ; get ah and bh
ldx  6,x
mul         ; // ah * bh * 2^16
pshx
ldhx  2,sp  ; pointer to result
sta  1,x   ; save result
pula
sta  ,x   ; save result

; // al * bl  calculation
tsx
lda  5,x   ; get al and bl
ldx  7,x
mul         ; // al * bl
pshx
ldhx  2,sp
sta  3,x   ; save result
pula
sta  2,x   ; save result
;
// ah * bl * 2^8  calculation
tsx
lda  4,x ; get ah and bl
ldx  7,x
mul       ; // ah * bl * 2^8
pshx
ldhx  2,sp
sta   2,x
pula
sta   1,x
bcc   L1
inc  ,x ; advance MS byte

L1:
// al * bh * 2^8  calculation
tsx
lda  5,x ; get al and b7
ldx  6,x
mul       ; // al * b7 * 2^8
pshx
ldhx  2,sp
sta   2,x
pula
sta   1,x
bcc   L2
inc  ,x ; advance MS byte

L2:
}
}

348 Views
Contributor I

This routine is "almost" fast enough for what I'm doing.

I'm trying to get it fastest, but I can.

What I need to do is to square a 16 bit number and get a 32 bit result: a*a=b  (a 16 bits) (b 32 bits).

Could anyone help me ?

348 Views
Senior Contributor II

If you happen to be using the ASM8 assembler, you could use the libraries (STAKMATH and related wrapper files) found here and then (with the use of the included macros) it would be as simple as the attached example.

Ans16 is for 16-bit result.

Ans32 for 32-bit result (from 16-bit number).

348 Views
Senior Contributor II

Hi Brax,

Any particular processor that you had in mind?

Do you want code that returns the product on the stack, or just uses the stack for temporary storage?

Do you wish the multiplier and multiplicand to be passed on the stack as well?

I have code for HC05, HC08 and S08. The code uses a 32-bit pseudo-accumulator, but can be modified to use the stack for the parameters and result. It uses the cpu registers for temporary storage, not the stack.

348 Views
Contributor III

I use a HCS08.  My idea is to use the stack for the temporary storage. If possible I don't want to declare temporary variable.  I don't need to return the product on the stack. I also don't need to pass the multiplier and multiplicand on the stack.

If you use only the CPU register for temporary storage it's great !!

348 Views
Senior Contributor II

Hi Brax,

OK, but if you don't pass the variables on the stack, how do you pass the operands into the subroutine? Two 16-bit operands won't fit in the registers. Do you have a pseudo-accumulator? That is how I get away with keeping the partial-products in registers. But there is no issues with using the stack for the partial-products, if need be.

If you simply have the two operands already sitting in memory, and want the product deposited in memory as well, then I have macros that can do that, rather than a subroutine, but they also use a pseudo-accumulator.

If you can describe better what you need I may be able to find something.

348 Views
Contributor III

Hi Mark,

I have one variable RAM who will take value between (0x0 to 0x3FF 10bits)

I want multiply this value with a Constant on 16bits.

I don't need a subroutines to do that because i will do that only ones in the main loop. So no parameters have to be sent to the subroutines.

If possible i don't what to have a pseudo-accumulator on 32-bit. But  i want  saving the temporary results on the stack ( I am not sure yet if this solution is possible).

348 Views
Senior Contributor II

Hi Brax,

I looked over all of my code going back 30 years, and it seems I have always used a pseudo-accumulator on 8-bit micros, unfortunately. This routine seems to fit the best, even though it's about 15 years old (early HC08). If you replace the early references of the pseudo-accumulator, the ones that reference the multiplicand, with your constant (as immediate operands), and then replace the remaining references with the location of your result, you should have what you need without needing a pseudo-accumulator and with using only one temporary byte on the stack. Between this and Mac's code, you should be able to put something together.

```;
;  The 32 bit pseudo-accumulator
;
ACCUM3:    ds.b    1    ;Most significant byte
ACCUM2:    ds.b    1
ACCUM1:    ds.b    1
ACCUM0:    ds.b    1    ;Least significant byte
;
;
; Multiply an 16 bit, unsigned integer in the pseudo-accumulator
; (multiplicand) by an 16 bit unsigned integer in X:A (multiplier).
; Exits with an 32 bit, unsigned integer product in the psuedo-accumulator.
; Uses one byte of stack space for temporary storage.
;
M16x16:    PHSA            ;don't loose the low 8 bits of multiplier
;  and reserve a byte on the stack
STX    ACCUM2        ;or the high 8 bits of multiplier either

LDX    ACCUM0        ;get low byte of multiplicand into X
MUL                 ;multiply lo multiplier with lo-byte multiplicand
STX    ACCUM3        ;temporary store mid-lo-byte of partial product
LDX    ACCUM0        ;get low byte of multiplicand into X, last time
STA    ACCUM0        ;and store lo-byte of product in Pseudo-accumulator

LDA    ACCUM2        ;get high byte of multiplier
MUL            ;multiply high multiplier with lo multiplicand
STA    ACCUM3        ;and replace partial product temporarily
TXA                 ;put mid-hi partial product in A
TAX                 ;put mid-hi with carry back in X

LDA    1,SP        ;get the low byte of multiplier again, last time
STX    1,SP        ;put mid-hi partial product aside
LDX    ACCUM1        ;get the high byte of multiplicand
MUL                 ;multiply low byte multiplier with high byte multiplicand
STA    ACCUM3        ;mid-lo is now complete, but misplaced
TXA                 ;get latest mid-hi partial product
STA    1,SP        ;put mid-hi aside again

LDX    ACCUM1        ;get high byte of multiplicand, last time
LDA    ACCUM2        ;get high byte of multiplier, last time
MUL                 ;multiply high byte with high byte
STA    ACCUM2        ;store where mid-hi is supposed to be
LDA    ACCUM3        ;get complete but misplaced mid-lo byte
STA    ACCUM1        ;and place it correctly
TXA                 ;get highest byte
STA    ACCUM3        ;and store to make things complete

PULA            ;clean the stack
RTS            ;and return with 32 bits of product
;

```

Sorry for the formatting . . . I can't get this board to behave . . . It truly sucks.

348 Views
Specialist III

Hello,

The following 16 x 16 multiply function is written in C, but extensively making use of inline assembler.  It should be easily adapted as "proper" assembly code.  The EQU directive can be used in lieu of each #define.  The stack is extensively used,

/********************************************************************/
// Unsigned multiply 16 x 16
// Execution cycles: ~230
// Stack usage: 17

// Offset values for stack frame structure
#define MCAND16_0  0    // MS byte Multiplicand
#define MCAND16_1  1    //  LS byte
#define MULT16_0   2    // MS byte Multiplier
#define MULT16_1   3    //  LS byte
#define PROD32_0   4    // MS byte Product
#define PROD32_1   5    //  3rd
#define PROD32_2   6    //   2nd
#define PROD32_3   7    //    LS byte
#define TEMP16     8    // Temporary storage

void UMULT16( word mult1, word mult2, dword *product)
{
__asm {
// Setup stack frame structure
AIS    #-5            // Temp storage & product result
LDHX   @mult2         // Multiplier
LDA    1,X            // LS byte
PSHA
LDA    ,X             // MS byte
PSHA

LDHX   @mult1         // Multiplicand
LDA    1,X            // LS byte
PSHA
LDA    ,X             // MS byte
PSHA

TSX
CLR    PROD32_0,X
LDA    MULT16_1,X     // Multiplier LS byte
LDX    MCAND16_1,X    // Multiplicand LS byte
MUL
STX    PROD32_2+1,SP
TSX
STA    PROD32_3,X

LDA    MULT16_1,X     // Multiplier LS byte again
LDX    MCAND16_0,X    // Multiplicand MS byte
MUL
STX    PROD32_1+1,SP
TSX
STA    PROD32_2,X
BCC    SKIP1
INC    PROD32_1,X
SKIP1:
LDA    MULT16_0,X     // Multiplier MS byte
LDX    MCAND16_1,X    // Multiplicand LS byte
MUL
STX    TEMP16+1,SP
TSX
STA    PROD32_2,X
BCC    SKIP2
INC    TEMP16,X
SKIP2:  LDA    PROD32_1,X
STA    PROD32_1,X
BCC    SKIP3
INC    PROD32_0,X
SKIP3:
LDA    MULT16_0,X     // Multiplier MS byte again
LDX    MCAND16_0,X    // Multiplicand MS byte
MUL
STX    TEMP16+1,SP
TSX
STA    PROD32_1,X
BCC    SKIP4
INC    TEMP16,X
SKIP4:  LDA    PROD32_0,X
STA    PROD32_0,X

AIS    #4             // Adjust stack pointer
LDHX   product
PULA
STA    ,X
PULA
STA    1,X
PULA
STA    2,X
PULA
STA    3,X
AIS    #1             // Adjust stack pointer
}
}

Regards,

Mac