Signed multiplication using MUL

JohanF · ‎06-14-2007

Hello out there,

I need a fast executing assembly code for signed multiplication (8bitx8bit) using the MUL instruction

/Johan

JohanF · ‎06-18-2007

Hi,

I run my application on a HC08.

/Johan

CompilerGuru · ‎06-18-2007

I did look it up, and on a HC08 STA DIR is 3 cycles as well, so placing the multiplication result into the zero page does help for the HC08 too.
Also note that the STA is one cycle faster than noted in the previous message.

>STA SUM+2 ; 4 cycles

Should be

STA SUM+2 ; 3 cycles

Make sure the assembler is using DIR addressing mode for that operation, with EXT it is really 4 cycles but it does buy anything as the INC/DEC needs DIR anyway.

Daniel

tonyp · ‎06-15-2007

And here's yet another version.

CompilerGuru · ‎06-15-2007

The CW _BMULS does take advantage of that the unsigned 8x8=16 and the signed 8x8=16 result are closely related. So it does the unsigned multiplication first and then patches the resulting high byte to match the signed multiplication behavior. It's 23 bytes, I did not count the cycles.

Daniel

JohanF · ‎06-15-2007

Thanks again,

Where do I find this lib\hc08c\src\rtshc08.c? Is it somwhere in the freescale webpage?

/Johan

bigmac · ‎06-15-2007

Hello,

The rtshc08.c file can be found within the CW installation. Once you find the file, you will need to search for the function _BMULS. This is an inline assembler routine within a C function wrapper.

For those interested, I have done a performance comparison between the three methods outlined -

Daniel:

Size: 26 bytes

Both values positive: 46 cycles

One value negative: 50 cycles

Both values negative: 54 cycles

TonyP:

Size: 30 bytes

Both values positive: 47 cycles

One value negative: 54 or 55 cycles

Both values negative: 49 cycles

Bigmac:

Size: 35 bytes

Both values positive: 43 cycles

One value negative: 57 or 58 cycles

Both values negative: 57 cycles

All results are based on the HCS08, and assume the routine is called with a JSR instruction (6 cycles). Not surprisingly, the performance differences are small, with the method outlined by Daniel having the best overall efficiency (as might be expected).

Regards,

Mac

JohanF · ‎06-15-2007

Interesting...

My application is that I want to calculate a DFT to detect a few discret fequencies, from a set of sampled signals.

The signalsamples are 8 bits, always positive.

The phasekoefficients are 8 bits either positive or negative.

So the algoritm is to multiply each sample by its pasekoefficient (into 16 bits result) and add them all together (into 24bits sum).

Based on the number of clockcycles the signed multiplication algoritm takes, I believe I will get away with less clockcycles by:

1. Check sign of phasecoefficient. If negative, negate it to make it positive

2. Calculate unsigned multiplication of phasecoefficient and sample.

3. If the phasecoefficient was positive from the beginning, add 16bit result to the sum.

4. If the phasecoefficient was negative from the beginning, subtract 16bit result from the sum.

As my application is very timecritical, I really need to minimize processing time. In this algoritm, what takes the most time is the adding or subtracting of 16bits to the 24bits sum, but I have to do that weather or not I use a signed multiply or if I do according to above.

bigmac · ‎06-16-2007

Hello Johan,

If my understanding of your proposal is correct, there appears to be some advantage in arranging so that the sign of the coefficient be placed in a separate register prior to DFT processing. This should increase the (unsigned) resolution of the coefficient from 7-bits to 8-bits.

The following code would seem to achieve what you describe -

;**************************************************************
; ACCUMULATE SCALED SAMPLE VALUE TO 24-BIT SUM
; On entry,
; ACC = sample value (8-bit unsigned),
; X = Phase coefficient (positive 8-bit value),
; PH_SIGN register (MSB) contains the sign of the coefficient.

; On exit, 24-bit value of SUM is updated.
; RAM registers PH_SIGN and SUM must reside in page zero.

PROCSAMP: MUL
BRCLR 7,PH_SIGN,PS1 ; Branch if positive sign

          ; Subtract scaled result from 24-bit sum
          PSHA
          PSHX
          LDA    SUM+2
          SUB    2,SP
          STA    SUM+2
          LDA    SUM+1
          SBC    1,SP
          STA    SUM+1
          BCC    *+4            ; Skip next if no carry
          DEC    SUM
          AIS    #2             ; Adjust stack pointer
          RTS

PS1:      ; Add scaled result to 24-bit sum
          ADD    SUM+2
          STA    SUM+2
          TXA
          ADC    SUM+1
          STA    SUM+1
          BCC    *+4            ; Skip next if no carry
          INC    SUM
          RTS

The processing of the routine should take 43 cycles for a positive coefficient, and 56 cycles for a negative coefficient. The question is can you live with this sort of sample processing period?

I also looked at the possibility, in the case of a negative coefficient, of negating the result of the multiplication, and then adding to SUM - this would appear to save only two cycles (assuming my coding is correct), and also quite a few bytes of code.

Regards,
Mac

Message Edited by bigmac on 2007-06-17 02:02 AM

JohanF · ‎06-17-2007

You have understood correctly. Your code is almost identical to what I came up with (but you won by a couple of cycles...).

I have pasted your code below and calculated cycles, I get 7 cycles less than you have calculated in both cases. Am I wrong?

Are you willing to share your code for taking the negative of a 16-bit number? I think I would benefit by gaining even just a few clockcycles for the negative coefficient.

PROCSAMP: MUL       ; 5 cycles
          BRCLR 7,PH_SIGN,PS1 ; Branch if positive sign ; 5 cycles

          ; Subtract scaled result from 24-bit sum
          PSHA       ; 2 cycles
          PSHX       ; 2 cycles
          LDA    SUM+2      ; 3 cycles
          SUB    2,SP      ; 4 cycles
          STA    SUM+2      ; 4 cycles
          LDA    SUM+1      ; 3 cycles
          SBC    1,SP      ; 4 cycles
          STA    SUM+1      ; 4 cycles
          BCC    *+4            ; Skip next if no carry  ; 3 cycles
          DEC    SUM      ; 4 cycles
          AIS    #2             ; Adjust stack pointer  ; 2 cycles
          RTS       ; 4 cycles

PS1:      ; Add scaled result to 24-bit sum
          ADD    SUM+2      ; 3 cycles
          STA    SUM+2      ; 4 cycles
          TXA       ; 1 cycle
          ADC    SUM+1      ; 3 cycles
          STA    SUM+1      ; 4 cycles
          BCC    *+4            ; Skip next if no carry  ; 3 cycles
          INC    SUM      ; 4 cycles
          RTS       ; 4 cycles

I get the result 36 cycles for a positive coefficient and 49 cycles for a negative coefficient.

CompilerGuru · ‎06-18-2007

Is this for a HC08 or a S08?
At least for an S08, STA DIR is just 3 cycles, then it would even help 2 cycles to store the multiplication result into the zero page instead of the stack.

Daniel

rocco · ‎06-17-2007

Oh, and here is a routine that negates the 16-bit value, using H:X

        com     value16        ; 4  - compliment the high byte        ldhx    value16        ; 4  - load all 16 bits into H:X        comx                   ; 1  - compliment the low byte        ais     #1             ; 2  - increment all 16 bits
        sthx    value16        ; 4  - save it in place (optional)
;                              =15

On the HC08, it takes only 15 cycles. It's a first draft, so it might be able to be improved upon.

peg · ‎06-17-2007

Hi rocco,

That would be aix (not ais)

Still time to edit?

rocco · ‎06-18-2007

OOps. Thanks Peg.

Unfortunately, it is too late to edit, although it has only been 45 minutes. What is the timeout, anyway?

So here is the corrected version:

        com     value16        ; 4  - compliment the high byte        ldhx    value16        ; 4  - load all 16 bits into H:X        comx                   ; 1  - compliment the low byte        aix     #1             ; 2  - increment all 16 bits        sthx    value16        ; 4  - save it in place (optional);                              =15

AIX and AIS both take 2 cycles.

peg · ‎06-18-2007

Hi,

I believe the edit time depends on the relative alignment of the planets. Whether or not pluto is included in the calculation seems to vary. The short answer - Try it!

bigmac · ‎06-18-2007

Hello all,

Since the negated 16-bit value must then be added to the 24 bit accumulator SUM, I would suspect that the negation process should also include the instruction COM SUM, an additional 5 cycles, to prepare for the addition.

Because the 16-bit value resides in X:A after the MUL instruction, the following represents my thoughts on the negation process -

COMX ; 1 cycle

NEGA ; 1 cycle

BNE *+3 ; 3 cycles

INCX ; 1 cycle

COM SUM ; 5 cycles

If my reckoning is correct, in the case of a negative coefficient, this should give a total cycle count of 53/54 for the routine.

Another possible reason for the cycle count discrepancy - when I estimate the cycle count of a sub-routine, I always allow for a JSR instruction (6 cycles) to call the routine.

Regards,

Mac

Message Edited by bigmac on 2007-06-18 01:45 PM

JohanF · ‎06-18-2007

Hi,

I have to ask the question, with the risk of looking stupid. Why to you need "COM SUM". To subtract a value from SUM, I would just take the 2-complement of the value and add to SUM, without manipulating SUM first.

What is your opinion?

/Johan

CompilerGuru · ‎06-18-2007

Hmm I think the COM SUM is a typo, it should be a DEC SUM instead.
Basically just adding the complement is correct, but you have to add the 24 bit complement. As we know
the multiplication as 24 bit signed value is negative for that case, adding the high byte means adding $FF to the high byte, so its the same as a DEC SUM.
Hmmm. That DEC SUM could probably be combined with the existing conditional INC SUM to just a conditional DEC SUM.

No guarantee that the attached code is correct. I get 45 cycles worst case for the subtraction case.

JohanF · ‎06-18-2007

Yes, off course, sometimes I just don't think it through!

When adding a negative 16-bit value to a 24-bit sum, the 16-bit value must naturally be sign-extended to 24-bits first, otherwise we have a serious bug.

That means it should probably be a DEC SUM, or something like that.

bigmac · ‎06-18-2007

Hello Daniel,

Yes, you are correct about the use of DEC SUM - I was wrong.

Your other alternative of using zero page RAM in lieu of stack, for temporary storage - I cannot see any advantage. Using the stack requires 2 cycles for each push, and 4 cycles for the subtract. Using zero page requires 3 cycles for store, and 3 cycles for subtract. The situation seems to be identical for both HC08 and S08.

For the code given, the differences between HC08 and S08 appear to be -

Calling JSR - HC08 = 5 cycles, S08 = 6 cycles

DEC SUM or INC SUM - HC08 = 4 cycles, S08 = 5 cycles

RTS - HC08 = 4 cycles, S08 = 6 cycles

For the final "negate version" of the routine, these are the cycles I get -

HC08:

Negative coefficient: 40, 41, 44 or 45 (44 mostly)

Positive coefficient: 35 or 39 (35 mostly)

S08:

Negative coefficient: 43, 44, 48 or 49 (48 mostly)

Positive coefficient: 38 or 43 (38 mostly)

Regards,

Mac

CompilerGuru · ‎06-18-2007

The benefit of the zeropage version is that it does not have to cleanup the stack, it does not need the AIS #2 and is 2 cycles quicker therefore.

Daniel

Signed multiplication using MUL

Signed multiplication using MUL

General