Multiply in ASM 16x16

Brax02 · ‎11-19-2012

Hello,

I am looking for a routine in ASM for multiply 16bits x 16bits. With the temporary results saved via the stack.

Maybe somebody can help me ?

Thanks in advance.

georgevandebelt · ‎11-21-2012

Here is another implementation of a 16 x 16 bit unsigned multiply.

/*
*
* void Mult16(unsigned short a, unsigned short b, unsigned long *result)
*
* Unsigned 16 bit multiply - generates 32 bit unsigned result
*
* Algorithm
*    Uses the algebraic formula
*      (x + y) (w + z) = xw + yz + yz + xz
*    result = ah * bh * 2^16 + (ah * bl + al * bh) * 2^8 + al * bl
*        where: ah = high byte of a
*                al = low byte of a
*                bh = high byte of b
*                bl = low byte of b
*
*   Execution cycles - 175 maximum - includes call and return
*   Stack usage - 9 bytes
*
* Stack (add to stack offset when X is pushed on stack)
* SP + 1,2 result address
* SP + 3,4 return address
* SP + 5 ah
* SP + 6 al
* SP + 7 bh
* SP + 8 bl
* */
void Mul16(unsigned short a, unsigned short b, unsigned long *result)
{
_asm {
    pshx        ; save result address
    pshh        ; C uses HX for result address argument

; // ah * bh * 2^16 calculation
    tsx         ; X addressing is faster/ smaller (HX = SP + 1)
lda 4,x   ; get ah and bh
ldx 6,x
mul         ; // ah * bh * 2^16
    pshx
    ldhx 2,sp ; pointer to result
sta 1,x   ; save result
    pula
sta ,x   ; save result

    ; // al * bl calculation
    tsx
lda 5,x   ; get al and bl
ldx 7,x
mul         ; // al * bl
    pshx
    ldhx 2,sp
sta 3,x   ; save result
    pula
sta 2,x   ; save result
;
    // ah * bl * 2^8 calculation
    tsx
lda 4,x ; get ah and bl
ldx 7,x
mul       ; // ah * bl * 2^8
    pshx
    ldhx 2,sp
add 2,x ; add result
sta   2,x
   pula
adc 1,x ; add result
sta   1,x
bcc   L1
inc ,x ; advance MS byte

L1:
    // al * bh * 2^8 calculation
    tsx
lda 5,x ; get al and b7
ldx 6,x
mul       ; // al * b7 * 2^8
    pshx
    ldhx 2,sp
add 2,x ; add result
sta   2,x
   pula
adc 1,x ; add result
sta   1,x
bcc   L2
inc ,x ; advance MS byte

L2:
ais #2 ; discard result address
}
}

chrled · ‎01-13-2014

This routine is "almost" fast enough for what I'm doing.

I'm trying to get it fastest, but I can.

What I need to do is to square a 16 bit number and get a 32 bit result: a*a=b (a 16 bits) (b 32 bits).

Could anyone help me ?

tonyp · ‎11-21-2012

If you happen to be using the ASM8 assembler, you could use the libraries (STAKMATH and related wrapper files) found here and then (with the use of the included macros) it would be as simple as the attached example.

Ans16 is for 16-bit result.

Ans32 for 32-bit result (from 16-bit number).

rocco · ‎11-19-2012

Hi Brax,

Any particular processor that you had in mind?

Do you want code that returns the product on the stack, or just uses the stack for temporary storage?

Do you wish the multiplier and multiplicand to be passed on the stack as well?

I have code for HC05, HC08 and S08. The code uses a 32-bit pseudo-accumulator, but can be modified to use the stack for the parameters and result. It uses the cpu registers for temporary storage, not the stack.

Brax02 · ‎11-20-2012

I use a HCS08. My idea is to use the stack for the temporary storage. If possible I don't want to declare temporary variable. I don't need to return the product on the stack. I also don't need to pass the multiplier and multiplicand on the stack.

If you use only the CPU register for temporary storage it's great !!

rocco · ‎11-20-2012

Hi Brax,

OK, but if you don't pass the variables on the stack, how do you pass the operands into the subroutine? Two 16-bit operands won't fit in the registers. Do you have a pseudo-accumulator? That is how I get away with keeping the partial-products in registers. But there is no issues with using the stack for the partial-products, if need be.

If you simply have the two operands already sitting in memory, and want the product deposited in memory as well, then I have macros that can do that, rather than a subroutine, but they also use a pseudo-accumulator.

If you can describe better what you need I may be able to find something.

Brax02 · ‎11-20-2012

Hi Mark,

Thanks for your quick answer.

I have one variable RAM who will take value between (0x0 to 0x3FF 10bits)

I want multiply this value with a Constant on 16bits.

I don't need a subroutines to do that because i will do that only ones in the main loop. So no parameters have to be sent to the subroutines.

If possible i don't what to have a pseudo-accumulator on 32-bit. But i want saving the temporary results on the stack ( I am not sure yet if this solution is possible).

rocco · ‎11-20-2012

Hi Brax,

I looked over all of my code going back 30 years, and it seems I have always used a pseudo-accumulator on 8-bit micros, unfortunately. This routine seems to fit the best, even though it's about 15 years old (early HC08). If you replace the early references of the pseudo-accumulator, the ones that reference the multiplicand, with your constant (as immediate operands), and then replace the remaining references with the location of your result, you should have what you need without needing a pseudo-accumulator and with using only one temporary byte on the stack. Between this and Mac's code, you should be able to put something together.

;

; The 32 bit pseudo-accumulator

;

ACCUM3: ds.b 1 ;Most significant byte

ACCUM2: ds.b 1

ACCUM1: ds.b 1

ACCUM0: ds.b 1 ;Least significant byte

;

; Multiply an 16 bit, unsigned integer in the pseudo-accumulator

; (multiplicand) by an 16 bit unsigned integer in X:A (multiplier).

; Exits with an 32 bit, unsigned integer product in the psuedo-accumulator.

; Uses one byte of stack space for temporary storage.

;

M16x16: PHSA ;don't loose the low 8 bits of multiplier

; and reserve a byte on the stack

STX ACCUM2 ;or the high 8 bits of multiplier either

LDX ACCUM0 ;get low byte of multiplicand into X

MUL ;multiply lo multiplier with lo-byte multiplicand

STX ACCUM3 ;temporary store mid-lo-byte of partial product

LDX ACCUM0 ;get low byte of multiplicand into X, last time

STA ACCUM0 ;and store lo-byte of product in Pseudo-accumulator

LDA ACCUM2 ;get high byte of multiplier

MUL ;multiply high multiplier with lo multiplicand

ADD ACCUM3 ;add previous mid-lo part.prod to new mid-lo part.prod

STA ACCUM3 ;and replace partial product temporarily

TXA ;put mid-hi partial product in A

ADC #0 ;put carry from previous ADD in

TAX ;put mid-hi with carry back in X

LDA 1,SP ;get the low byte of multiplier again, last time

STX 1,SP ;put mid-hi partial product aside

LDX ACCUM1 ;get the high byte of multiplicand

MUL ;multiply low byte multiplier with high byte multiplicand

ADD ACCUM3 ;add previous mid-lo partial product to last mid-lo piece

STA ACCUM3 ;mid-lo is now complete, but misplaced

TXA ;get latest mid-hi partial product

ADC 1,SP ;add carry and previous mid-hi part

STA 1,SP ;put mid-hi aside again

LDX ACCUM1 ;get high byte of multiplicand, last time

LDA ACCUM2 ;get high byte of multiplier, last time

MUL ;multiply high byte with high byte

ADD 1,SP ;add previous mid-hi byte to new mid-hi byte

STA ACCUM2 ;store where mid-hi is supposed to be

LDA ACCUM3 ;get complete but misplaced mid-lo byte

STA ACCUM1 ;and place it correctly

TXA ;get highest byte

ADC #0 ;add any carry from previous add

STA ACCUM3 ;and store to make things complete

PULA ;clean the stack

RTS ;and return with 32 bits of product

;

Sorry for the formatting . . . I can't get this board to behave . . . It truly sucks.

bigmac · ‎11-20-2012

Hello,

The following 16 x 16 multiply function is written in C, but extensively making use of inline assembler. It should be easily adapted as "proper" assembly code. The EQU directive can be used in lieu of each #define. The stack is extensively used,

/********************************************************************/
// Unsigned multiply 16 x 16
// Execution cycles: ~230
// Stack usage: 17

// Offset values for stack frame structure
#define MCAND16_0 0    // MS byte Multiplicand
#define MCAND16_1 1    // LS byte
#define MULT16_0   2    // MS byte Multiplier
#define MULT16_1   3    // LS byte
#define PROD32_0   4    // MS byte Product
#define PROD32_1   5    // 3rd
#define PROD32_2   6    //   2nd
#define PROD32_3   7    //    LS byte
#define TEMP16     8    // Temporary storage

void UMULT16( word mult1, word mult2, dword *product)
{
__asm {
        // Setup stack frame structure
        AIS    #-5            // Temp storage & product result
        LDHX   @mult2         // Multiplier
        LDA    1,X            // LS byte
        PSHA
        LDA    ,X             // MS byte
        PSHA

        LDHX   @mult1         // Multiplicand
        LDA    1,X            // LS byte
        PSHA
        LDA    ,X             // MS byte
        PSHA

        TSX
        CLR    PROD32_0,X
        LDA    MULT16_1,X     // Multiplier LS byte
        LDX    MCAND16_1,X    // Multiplicand LS byte
        MUL
        STX    PROD32_2+1,SP
        TSX
        STA    PROD32_3,X

        LDA    MULT16_1,X     // Multiplier LS byte again
        LDX    MCAND16_0,X    // Multiplicand MS byte
        MUL
        STX    PROD32_1+1,SP
        TSX
        ADD    PROD32_2,X
        STA    PROD32_2,X
        BCC    SKIP1
        INC    PROD32_1,X
SKIP1:
        LDA    MULT16_0,X     // Multiplier MS byte
        LDX    MCAND16_1,X    // Multiplicand LS byte
        MUL
        STX    TEMP16+1,SP
        TSX
        ADD    PROD32_2,X
        STA    PROD32_2,X
        BCC    SKIP2
        INC    TEMP16,X
SKIP2: LDA    PROD32_1,X
        ADD    TEMP16,X
        STA    PROD32_1,X
        BCC    SKIP3
        INC    PROD32_0,X
SKIP3:
        LDA    MULT16_0,X     // Multiplier MS byte again
        LDX    MCAND16_0,X    // Multiplicand MS byte
        MUL
        STX    TEMP16+1,SP
        TSX
        ADD    PROD32_1,X
        STA    PROD32_1,X
        BCC    SKIP4
        INC    TEMP16,X
SKIP4: LDA    PROD32_0,X
        ADD    TEMP16,X
        STA    PROD32_0,X

        // Unload stack frame structure
        AIS    #4             // Adjust stack pointer
        LDHX   product
        PULA
        STA    ,X
        PULA
        STA    1,X
        PULA
        STA    2,X
        PULA
        STA    3,X
        AIS    #1             // Adjust stack pointer
}
}

Regards,

Mac

Multiply in ASM 16x16

Multiply in ASM 16x16

General