Hello,
I am looking for a routine in ASM for multiply 16bits x 16bits. With the temporary results saved via the stack.
Maybe somebody can help me ?
Thanks in advance.
Here is another implementation of a 16 x 16 bit unsigned multiply.
/*
*
* void Mult16(unsigned short a, unsigned short b, unsigned long *result)
*
* Unsigned 16 bit multiply - generates 32 bit unsigned result
*
* Algorithm
* Uses the algebraic formula
* (x + y) (w + z) = xw + yz + yz + xz
* result = ah * bh * 2^16 + (ah * bl + al * bh) * 2^8 + al * bl
* where: ah = high byte of a
* al = low byte of a
* bh = high byte of b
* bl = low byte of b
*
* Execution cycles - 175 maximum - includes call and return
* Stack usage - 9 bytes
*
* Stack (add to stack offset when X is pushed on stack)
* SP + 1,2 result address
* SP + 3,4 return address
* SP + 5 ah
* SP + 6 al
* SP + 7 bh
* SP + 8 bl
* */
void Mul16(unsigned short a, unsigned short b, unsigned long *result)
{
_asm {
pshx ; save result address
pshh ; C uses HX for result address argument
; // ah * bh * 2^16 calculation
tsx ; X addressing is faster/ smaller (HX = SP + 1)
lda 4,x ; get ah and bh
ldx 6,x
mul ; // ah * bh * 2^16
pshx
ldhx 2,sp ; pointer to result
sta 1,x ; save result
pula
sta ,x ; save result
; // al * bl calculation
tsx
lda 5,x ; get al and bl
ldx 7,x
mul ; // al * bl
pshx
ldhx 2,sp
sta 3,x ; save result
pula
sta 2,x ; save result
;
// ah * bl * 2^8 calculation
tsx
lda 4,x ; get ah and bl
ldx 7,x
mul ; // ah * bl * 2^8
pshx
ldhx 2,sp
add 2,x ; add result
sta 2,x
pula
adc 1,x ; add result
sta 1,x
bcc L1
inc ,x ; advance MS byte
L1:
// al * bh * 2^8 calculation
tsx
lda 5,x ; get al and b7
ldx 6,x
mul ; // al * b7 * 2^8
pshx
ldhx 2,sp
add 2,x ; add result
sta 2,x
pula
adc 1,x ; add result
sta 1,x
bcc L2
inc ,x ; advance MS byte
L2:
ais #2 ; discard result address
}
}
This routine is "almost" fast enough for what I'm doing.
I'm trying to get it fastest, but I can.
What I need to do is to square a 16 bit number and get a 32 bit result: a*a=b (a 16 bits) (b 32 bits).
Could anyone help me ?
If you happen to be using the ASM8 assembler, you could use the libraries (STAKMATH and related wrapper files) found here and then (with the use of the included macros) it would be as simple as the attached example.
Ans16 is for 16-bit result.
Ans32 for 32-bit result (from 16-bit number).
Hi Brax,
Any particular processor that you had in mind?
Do you want code that returns the product on the stack, or just uses the stack for temporary storage?
Do you wish the multiplier and multiplicand to be passed on the stack as well?
I have code for HC05, HC08 and S08. The code uses a 32-bit pseudo-accumulator, but can be modified to use the stack for the parameters and result. It uses the cpu registers for temporary storage, not the stack.
I use a HCS08. My idea is to use the stack for the temporary storage. If possible I don't want to declare temporary variable. I don't need to return the product on the stack. I also don't need to pass the multiplier and multiplicand on the stack.
If you use only the CPU register for temporary storage it's great !!
Hi Brax,
OK, but if you don't pass the variables on the stack, how do you pass the operands into the subroutine? Two 16-bit operands won't fit in the registers. Do you have a pseudo-accumulator? That is how I get away with keeping the partial-products in registers. But there is no issues with using the stack for the partial-products, if need be.
If you simply have the two operands already sitting in memory, and want the product deposited in memory as well, then I have macros that can do that, rather than a subroutine, but they also use a pseudo-accumulator.
If you can describe better what you need I may be able to find something.
Hi Mark,
Thanks for your quick answer.
I have one variable RAM who will take value between (0x0 to 0x3FF 10bits)
I want multiply this value with a Constant on 16bits.
I don't need a subroutines to do that because i will do that only ones in the main loop. So no parameters have to be sent to the subroutines.
If possible i don't what to have a pseudo-accumulator on 32-bit. But i want saving the temporary results on the stack ( I am not sure yet if this solution is possible).
Hi Brax,
I looked over all of my code going back 30 years, and it seems I have always used a pseudo-accumulator on 8-bit micros, unfortunately. This routine seems to fit the best, even though it's about 15 years old (early HC08). If you replace the early references of the pseudo-accumulator, the ones that reference the multiplicand, with your constant (as immediate operands), and then replace the remaining references with the location of your result, you should have what you need without needing a pseudo-accumulator and with using only one temporary byte on the stack. Between this and Mac's code, you should be able to put something together.
;
; The 32 bit pseudo-accumulator
;
ACCUM3: ds.b 1 ;Most significant byte
ACCUM2: ds.b 1
ACCUM1: ds.b 1
ACCUM0: ds.b 1 ;Least significant byte
;
;
; Multiply an 16 bit, unsigned integer in the pseudo-accumulator
; (multiplicand) by an 16 bit unsigned integer in X:A (multiplier).
; Exits with an 32 bit, unsigned integer product in the psuedo-accumulator.
; Uses one byte of stack space for temporary storage.
;
M16x16: PHSA ;don't loose the low 8 bits of multiplier
; and reserve a byte on the stack
STX ACCUM2 ;or the high 8 bits of multiplier either
LDX ACCUM0 ;get low byte of multiplicand into X
MUL ;multiply lo multiplier with lo-byte multiplicand
STX ACCUM3 ;temporary store mid-lo-byte of partial product
LDX ACCUM0 ;get low byte of multiplicand into X, last time
STA ACCUM0 ;and store lo-byte of product in Pseudo-accumulator
LDA ACCUM2 ;get high byte of multiplier
MUL ;multiply high multiplier with lo multiplicand
ADD ACCUM3 ;add previous mid-lo part.prod to new mid-lo part.prod
STA ACCUM3 ;and replace partial product temporarily
TXA ;put mid-hi partial product in A
ADC #0 ;put carry from previous ADD in
TAX ;put mid-hi with carry back in X
LDA 1,SP ;get the low byte of multiplier again, last time
STX 1,SP ;put mid-hi partial product aside
LDX ACCUM1 ;get the high byte of multiplicand
MUL ;multiply low byte multiplier with high byte multiplicand
ADD ACCUM3 ;add previous mid-lo partial product to last mid-lo piece
STA ACCUM3 ;mid-lo is now complete, but misplaced
TXA ;get latest mid-hi partial product
ADC 1,SP ;add carry and previous mid-hi part
STA 1,SP ;put mid-hi aside again
LDX ACCUM1 ;get high byte of multiplicand, last time
LDA ACCUM2 ;get high byte of multiplier, last time
MUL ;multiply high byte with high byte
ADD 1,SP ;add previous mid-hi byte to new mid-hi byte
STA ACCUM2 ;store where mid-hi is supposed to be
LDA ACCUM3 ;get complete but misplaced mid-lo byte
STA ACCUM1 ;and place it correctly
TXA ;get highest byte
ADC #0 ;add any carry from previous add
STA ACCUM3 ;and store to make things complete
PULA ;clean the stack
RTS ;and return with 32 bits of product
;
Sorry for the formatting . . . I can't get this board to behave . . . It truly sucks.
Hello,
The following 16 x 16 multiply function is written in C, but extensively making use of inline assembler. It should be easily adapted as "proper" assembly code. The EQU directive can be used in lieu of each #define. The stack is extensively used,
/********************************************************************/
// Unsigned multiply 16 x 16
// Execution cycles: ~230
// Stack usage: 17
// Offset values for stack frame structure
#define MCAND16_0 0 // MS byte Multiplicand
#define MCAND16_1 1 // LS byte
#define MULT16_0 2 // MS byte Multiplier
#define MULT16_1 3 // LS byte
#define PROD32_0 4 // MS byte Product
#define PROD32_1 5 // 3rd
#define PROD32_2 6 // 2nd
#define PROD32_3 7 // LS byte
#define TEMP16 8 // Temporary storage
void UMULT16( word mult1, word mult2, dword *product)
{
__asm {
// Setup stack frame structure
AIS #-5 // Temp storage & product result
LDHX @mult2 // Multiplier
LDA 1,X // LS byte
PSHA
LDA ,X // MS byte
PSHA
LDHX @mult1 // Multiplicand
LDA 1,X // LS byte
PSHA
LDA ,X // MS byte
PSHA
TSX
CLR PROD32_0,X
LDA MULT16_1,X // Multiplier LS byte
LDX MCAND16_1,X // Multiplicand LS byte
MUL
STX PROD32_2+1,SP
TSX
STA PROD32_3,X
LDA MULT16_1,X // Multiplier LS byte again
LDX MCAND16_0,X // Multiplicand MS byte
MUL
STX PROD32_1+1,SP
TSX
ADD PROD32_2,X
STA PROD32_2,X
BCC SKIP1
INC PROD32_1,X
SKIP1:
LDA MULT16_0,X // Multiplier MS byte
LDX MCAND16_1,X // Multiplicand LS byte
MUL
STX TEMP16+1,SP
TSX
ADD PROD32_2,X
STA PROD32_2,X
BCC SKIP2
INC TEMP16,X
SKIP2: LDA PROD32_1,X
ADD TEMP16,X
STA PROD32_1,X
BCC SKIP3
INC PROD32_0,X
SKIP3:
LDA MULT16_0,X // Multiplier MS byte again
LDX MCAND16_0,X // Multiplicand MS byte
MUL
STX TEMP16+1,SP
TSX
ADD PROD32_1,X
STA PROD32_1,X
BCC SKIP4
INC TEMP16,X
SKIP4: LDA PROD32_0,X
ADD TEMP16,X
STA PROD32_0,X
// Unload stack frame structure
AIS #4 // Adjust stack pointer
LDHX product
PULA
STA ,X
PULA
STA 1,X
PULA
STA 2,X
PULA
STA 3,X
AIS #1 // Adjust stack pointer
}
}
Regards,
Mac