Multiply in ASM 16x16

cancel
Showing results for 
Search instead for 
Did you mean: 

Multiply in ASM 16x16

1,445 Views
Brax02
Contributor III

Hello,

 

I am looking for a routine in ASM for multiply 16bits x 16bits. With the temporary results saved via the stack.

 

Maybe somebody can help me ?

 

Thanks in advance.

Labels (1)
0 Kudos
9 Replies

348 Views
georgevandebelt
Contributor I

Here is another implementation of a 16 x 16 bit unsigned multiply.

/*
*
* void Mult16(unsigned short a, unsigned short b, unsigned long *result)
*
*  Unsigned 16 bit multiply - generates 32 bit unsigned result
*
*  Algorithm
*    Uses the algebraic formula
*      (x + y) (w + z) = xw + yz + yz + xz
*    result = ah * bh * 2^16 + (ah * bl + al * bh) * 2^8 + al * bl
*        where:  ah = high byte of a
*                al = low byte of a
*                bh = high byte of b
*                bl = low byte of b
*
*   Execution cycles - 175 maximum - includes call and return
*   Stack usage - 9 bytes
*
* Stack (add to stack offset when X is pushed on stack)
*  SP + 1,2 result address
*  SP + 3,4 return address
*  SP + 5 ah
*  SP + 6 al
*  SP + 7 bh
*  SP + 8 bl
*  */
void Mul16(unsigned short a, unsigned short b, unsigned long *result)
{
  _asm {
    pshx        ; save result address
    pshh        ; C uses HX for result address argument

  ; // ah * bh * 2^16  calculation
    tsx         ; X addressing is faster/ smaller (HX = SP + 1)
  lda  4,x   ; get ah and bh
  ldx  6,x
  mul         ; // ah * bh * 2^16
    pshx
    ldhx  2,sp  ; pointer to result
  sta  1,x   ; save result
    pula
  sta  ,x   ; save result

    ; // al * bl  calculation
    tsx
  lda  5,x   ; get al and bl
  ldx  7,x
  mul         ; // al * bl
    pshx
    ldhx  2,sp
  sta  3,x   ; save result
    pula
  sta  2,x   ; save result
;
    // ah * bl * 2^8  calculation
    tsx
  lda  4,x ; get ah and bl
  ldx  7,x
  mul       ; // ah * bl * 2^8
    pshx
    ldhx  2,sp
  add  2,x ; add result
  sta   2,x
   pula
  adc  1,x ; add result
  sta   1,x
  bcc   L1
  inc  ,x ; advance MS byte

L1:
    // al * bh * 2^8  calculation
    tsx
  lda  5,x ; get al and b7
  ldx  6,x
  mul       ; // al * b7 * 2^8
    pshx
    ldhx  2,sp
  add  2,x ; add result
  sta   2,x
   pula
  adc  1,x ; add result
  sta   1,x
  bcc   L2
  inc  ,x ; advance MS byte

L2:
    ais   #2  ; discard result address
  }
}   

0 Kudos

348 Views
chrled
Contributor I

This routine is "almost" fast enough for what I'm doing.

I'm trying to get it fastest, but I can.

What I need to do is to square a 16 bit number and get a 32 bit result: a*a=b  (a 16 bits) (b 32 bits).

Could anyone help me ?

0 Kudos

348 Views
tonyp
Senior Contributor II

If you happen to be using the ASM8 assembler, you could use the libraries (STAKMATH and related wrapper files) found here and then (with the use of the included macros) it would be as simple as the attached example.

Ans16 is for 16-bit result.

Ans32 for 32-bit result (from 16-bit number).

0 Kudos

348 Views
rocco
Senior Contributor II

Hi Brax,

Any particular processor that you had in mind?

Do you want code that returns the product on the stack, or just uses the stack for temporary storage?

Do you wish the multiplier and multiplicand to be passed on the stack as well?

I have code for HC05, HC08 and S08. The code uses a 32-bit pseudo-accumulator, but can be modified to use the stack for the parameters and result. It uses the cpu registers for temporary storage, not the stack.

0 Kudos

348 Views
Brax02
Contributor III

I use a HCS08.  My idea is to use the stack for the temporary storage. If possible I don't want to declare temporary variable.  I don't need to return the product on the stack. I also don't need to pass the multiplier and multiplicand on the stack.

If you use only the CPU register for temporary storage it's great !!

0 Kudos

348 Views
rocco
Senior Contributor II

Hi Brax,

OK, but if you don't pass the variables on the stack, how do you pass the operands into the subroutine? Two 16-bit operands won't fit in the registers. Do you have a pseudo-accumulator? That is how I get away with keeping the partial-products in registers. But there is no issues with using the stack for the partial-products, if need be.

If you simply have the two operands already sitting in memory, and want the product deposited in memory as well, then I have macros that can do that, rather than a subroutine, but they also use a pseudo-accumulator.

If you can describe better what you need I may be able to find something.

0 Kudos

348 Views
Brax02
Contributor III

Hi Mark,

Thanks for your quick answer.

I have one variable RAM who will take value between (0x0 to 0x3FF 10bits)

I want multiply this value with a Constant on 16bits.

I don't need a subroutines to do that because i will do that only ones in the main loop. So no parameters have to be sent to the subroutines.

If possible i don't what to have a pseudo-accumulator on 32-bit. But  i want  saving the temporary results on the stack ( I am not sure yet if this solution is possible).

0 Kudos

348 Views
rocco
Senior Contributor II

Hi Brax,

I looked over all of my code going back 30 years, and it seems I have always used a pseudo-accumulator on 8-bit micros, unfortunately. This routine seems to fit the best, even though it's about 15 years old (early HC08). If you replace the early references of the pseudo-accumulator, the ones that reference the multiplicand, with your constant (as immediate operands), and then replace the remaining references with the location of your result, you should have what you need without needing a pseudo-accumulator and with using only one temporary byte on the stack. Between this and Mac's code, you should be able to put something together.

;

;  The 32 bit pseudo-accumulator

;

ACCUM3:    ds.b    1    ;Most significant byte

ACCUM2:    ds.b    1

ACCUM1:    ds.b    1

ACCUM0:    ds.b    1    ;Least significant byte

;

;

; Multiply an 16 bit, unsigned integer in the pseudo-accumulator

; (multiplicand) by an 16 bit unsigned integer in X:A (multiplier).

; Exits with an 32 bit, unsigned integer product in the psuedo-accumulator.

; Uses one byte of stack space for temporary storage.

;

M16x16:    PHSA            ;don't loose the low 8 bits of multiplier

                          ;  and reserve a byte on the stack

    STX    ACCUM2        ;or the high 8 bits of multiplier either

    LDX    ACCUM0        ;get low byte of multiplicand into X

    MUL                 ;multiply lo multiplier with lo-byte multiplicand

    STX    ACCUM3        ;temporary store mid-lo-byte of partial product

    LDX    ACCUM0        ;get low byte of multiplicand into X, last time

    STA    ACCUM0        ;and store lo-byte of product in Pseudo-accumulator

    LDA    ACCUM2        ;get high byte of multiplier

    MUL            ;multiply high multiplier with lo multiplicand

    ADD    ACCUM3        ;add previous mid-lo part.prod to new mid-lo part.prod

    STA    ACCUM3        ;and replace partial product temporarily

    TXA                 ;put mid-hi partial product in A

    ADC    #0        ;put carry from previous ADD in

    TAX                 ;put mid-hi with carry back in X

    LDA    1,SP        ;get the low byte of multiplier again, last time

    STX    1,SP        ;put mid-hi partial product aside

    LDX    ACCUM1        ;get the high byte of multiplicand

    MUL                 ;multiply low byte multiplier with high byte multiplicand

    ADD    ACCUM3        ;add previous mid-lo partial product to last mid-lo piece

    STA    ACCUM3        ;mid-lo is now complete, but misplaced

    TXA                 ;get latest mid-hi partial product

    ADC    1,SP        ;add carry and previous mid-hi part

    STA    1,SP        ;put mid-hi aside again

    LDX    ACCUM1        ;get high byte of multiplicand, last time

    LDA    ACCUM2        ;get high byte of multiplier, last time

    MUL                 ;multiply high byte with high byte

    ADD    1,SP        ;add previous mid-hi byte to new mid-hi byte

    STA    ACCUM2        ;store where mid-hi is supposed to be

    LDA    ACCUM3        ;get complete but misplaced mid-lo byte

    STA    ACCUM1        ;and place it correctly

    TXA                 ;get highest byte

    ADC    #0        ;add any carry from previous add

    STA    ACCUM3        ;and store to make things complete

    PULA            ;clean the stack

    RTS            ;and return with 32 bits of product

;

Sorry for the formatting . . . I can't get this board to behave . . . It truly sucks.

0 Kudos

348 Views
bigmac
Specialist III

Hello,

The following 16 x 16 multiply function is written in C, but extensively making use of inline assembler.  It should be easily adapted as "proper" assembly code.  The EQU directive can be used in lieu of each #define.  The stack is extensively used,

/********************************************************************/
// Unsigned multiply 16 x 16
// Execution cycles: ~230
// Stack usage: 17

// Offset values for stack frame structure
#define MCAND16_0  0    // MS byte Multiplicand
#define MCAND16_1  1    //  LS byte
#define MULT16_0   2    // MS byte Multiplier
#define MULT16_1   3    //  LS byte
#define PROD32_0   4    // MS byte Product
#define PROD32_1   5    //  3rd
#define PROD32_2   6    //   2nd
#define PROD32_3   7    //    LS byte
#define TEMP16     8    // Temporary storage

void UMULT16( word mult1, word mult2, dword *product)
{
  __asm {
        // Setup stack frame structure
        AIS    #-5            // Temp storage & product result
        LDHX   @mult2         // Multiplier
        LDA    1,X            // LS byte
        PSHA
        LDA    ,X             // MS byte
        PSHA
       

        LDHX   @mult1         // Multiplicand
        LDA    1,X            // LS byte
        PSHA
        LDA    ,X             // MS byte
        PSHA

        TSX
        CLR    PROD32_0,X
        LDA    MULT16_1,X     // Multiplier LS byte
        LDX    MCAND16_1,X    // Multiplicand LS byte
        MUL
        STX    PROD32_2+1,SP
        TSX
        STA    PROD32_3,X

        LDA    MULT16_1,X     // Multiplier LS byte again
        LDX    MCAND16_0,X    // Multiplicand MS byte
        MUL
        STX    PROD32_1+1,SP
        TSX
        ADD    PROD32_2,X
        STA    PROD32_2,X
        BCC    SKIP1
        INC    PROD32_1,X
SKIP1:   
        LDA    MULT16_0,X     // Multiplier MS byte
        LDX    MCAND16_1,X    // Multiplicand LS byte
        MUL
        STX    TEMP16+1,SP
        TSX
        ADD    PROD32_2,X
        STA    PROD32_2,X
        BCC    SKIP2
        INC    TEMP16,X
SKIP2:  LDA    PROD32_1,X
        ADD    TEMP16,X
        STA    PROD32_1,X
        BCC    SKIP3
        INC    PROD32_0,X
SKIP3:
        LDA    MULT16_0,X     // Multiplier MS byte again
        LDX    MCAND16_0,X    // Multiplicand MS byte
        MUL
        STX    TEMP16+1,SP
        TSX
        ADD    PROD32_1,X
        STA    PROD32_1,X
        BCC    SKIP4
        INC    TEMP16,X
SKIP4:  LDA    PROD32_0,X
        ADD    TEMP16,X
        STA    PROD32_0,X

       

        // Unload stack frame structure
        AIS    #4             // Adjust stack pointer
        LDHX   product
        PULA
        STA    ,X
        PULA
        STA    1,X
        PULA
        STA    2,X
        PULA
        STA    3,X
        AIS    #1             // Adjust stack pointer
  }
}

Regards,

Mac


0 Kudos