Signed multiplication using MUL

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Signed multiplication using MUL

6,347 Views
JohanF
Contributor I
Hello out there,
I need a fast executing assembly code for signed multiplication (8bitx8bit) using the MUL instruction
/Johan 
Labels (1)
0 Kudos
29 Replies

1,471 Views
JohanF
Contributor I
Hi,
 
I run my application on a HC08.
/Johan
0 Kudos

1,471 Views
CompilerGuru
NXP Employee
NXP Employee
I did look it up, and on a HC08 STA DIR is 3 cycles as well, so placing the multiplication result into the zero page does help for the HC08 too.
Also note that the STA is one cycle faster than noted in the previous message.

>STA    SUM+2      ; 4 cycles

Should be

STA    SUM+2      ; 3 cycles

Make sure the assembler is using DIR addressing mode for that operation, with EXT it is really 4 cycles but it does buy anything as the INC/DEC needs DIR anyway.

Daniel
0 Kudos

1,471 Views
tonyp
Senior Contributor II
And here's yet another version.
0 Kudos

1,471 Views
CompilerGuru
NXP Employee
NXP Employee
The CW _BMULS does take advantage of that the unsigned 8x8=16 and the signed 8x8=16 result are closely related. So it does the unsigned multiplication first and then patches the resulting high byte to match the signed multiplication behavior. It's 23 bytes, I did not count the cycles.

Daniel


0 Kudos

1,471 Views
JohanF
Contributor I
Thanks again,
 
Where do I find this lib\hc08c\src\rtshc08.c? Is it somwhere in the freescale webpage?
 
/Johan
0 Kudos

1,471 Views
bigmac
Specialist III
Hello,
 
The rtshc08.c file can be found within the CW installation.  Once you find the file, you will need to search for the function _BMULS.  This is an inline assembler routine within a C function wrapper.
 
For those interested, I have done a performance comparison between the three methods outlined -
 
Daniel:
Size:  26 bytes
Both values positive:  46 cycles
One value negative:  50 cycles
Both values negative: 54 cycles
 
TonyP:
Size:  30 bytes
Both values positive:  47 cycles
One value negative:  54 or 55 cycles
Both values negative: 49 cycles
 
Bigmac:
Size:  35 bytes
Both values positive:  43 cycles
One value negative:  57 or 58 cycles
Both values negative: 57 cycles
 
All results are based on the HCS08, and assume the routine is called with a JSR instruction (6 cycles).  Not surprisingly, the performance differences are small, with the method outlined by Daniel having the best overall efficiency (as might be expected).
 
Regards,
Mac
 
0 Kudos

1,471 Views
JohanF
Contributor I
Interesting...
 
My application is that I want to calculate a DFT to detect a few discret fequencies, from a set of sampled signals.
The signalsamples are 8 bits, always positive.
The phasekoefficients are 8 bits either positive or negative.
 
So the algoritm is to multiply each sample by its pasekoefficient (into 16 bits result) and add them all together (into 24bits sum).
Based on the number of clockcycles the signed multiplication algoritm takes, I believe I will get away with less clockcycles by:
 
1. Check sign of phasecoefficient. If negative, negate it to make it positive
2. Calculate unsigned multiplication of phasecoefficient and sample.
3. If the phasecoefficient was positive from the beginning, add 16bit result to the sum.
4. If the phasecoefficient was negative from the beginning, subtract 16bit result from the sum.   
 
As my application is very timecritical, I really need to minimize processing time. In this algoritm, what takes the most time is the adding or subtracting of 16bits to the 24bits sum, but I have to do that weather or not I use a signed multiply or if I do according to above.
0 Kudos

1,471 Views
bigmac
Specialist III
Hello Johan,
 
If my understanding of your proposal is correct, there appears to be some advantage in arranging so that the sign of the coefficient be placed in a separate register prior to DFT processing.  This should increase the (unsigned) resolution of the coefficient from 7-bits to 8-bits.
 
The following code would seem to achieve what you describe -
 
;**************************************************************
; ACCUMULATE SCALED SAMPLE VALUE TO 24-BIT SUM
; On entry,
; ACC = sample value (8-bit unsigned),
; X = Phase coefficient (positive 8-bit value),
; PH_SIGN register (MSB) contains the sign of the coefficient.
 
; On exit, 24-bit value of SUM is updated.
; RAM registers PH_SIGN and SUM must reside in page zero.
 
PROCSAMP: MUL
          BRCLR  7,PH_SIGN,PS1  ; Branch if positive sign
 
          ; Subtract scaled result from 24-bit sum
          PSHA
          PSHX
          LDA    SUM+2
          SUB    2,SP
          STA    SUM+2
          LDA    SUM+1
          SBC    1,SP
          STA    SUM+1
          BCC    *+4            ; Skip next if no carry
          DEC    SUM
          AIS    #2             ; Adjust stack pointer
          RTS
 
PS1:      ; Add scaled result to 24-bit sum
          ADD    SUM+2
          STA    SUM+2
          TXA
          ADC    SUM+1
          STA    SUM+1
          BCC    *+4            ; Skip next if no carry
          INC    SUM
          RTS

The processing of the routine should take 43 cycles for a positive coefficient, and 56 cycles for a negative coefficient.  The question is can you live with this sort of sample processing period?
 
I also looked at the possibility, in the case of a negative coefficient, of negating the result of the multiplication, and then adding to SUM - this would appear to save only two cycles (assuming my coding is correct), and also quite a few bytes of code.

Regards,
Mac

 




Message Edited by bigmac on 2007-06-17 02:02 AM
0 Kudos

1,471 Views
JohanF
Contributor I
You have understood correctly. Your code is almost identical to what I came up with (but you won by a couple of cycles...).
I have pasted your code below and calculated cycles, I get 7 cycles less than you have calculated in both cases. Am I wrong? 
Are you willing to share your code for taking the negative of a 16-bit number? I think I would benefit by gaining even just a few clockcycles for the negative coefficient. 
 
PROCSAMP: MUL       ; 5 cycles
          BRCLR  7,PH_SIGN,PS1  ; Branch if positive sign ; 5 cycles
 
          ; Subtract scaled result from 24-bit sum
          PSHA       ; 2 cycles
          PSHX       ; 2 cycles
          LDA    SUM+2      ; 3 cycles
          SUB    2,SP      ; 4 cycles
          STA    SUM+2      ; 4 cycles
          LDA    SUM+1      ; 3 cycles
          SBC    1,SP      ; 4 cycles
          STA    SUM+1      ; 4 cycles
          BCC    *+4            ; Skip next if no carry  ; 3 cycles
          DEC    SUM      ; 4 cycles
          AIS    #2             ; Adjust stack pointer  ; 2 cycles
          RTS       ; 4 cycles
 
PS1:      ; Add scaled result to 24-bit sum
          ADD    SUM+2      ; 3 cycles
          STA    SUM+2      ; 4 cycles
          TXA       ; 1 cycle
          ADC    SUM+1      ; 3 cycles
          STA    SUM+1      ; 4 cycles
          BCC    *+4            ; Skip next if no carry  ; 3 cycles
          INC    SUM      ; 4 cycles
          RTS       ; 4 cycles

I get the result 36 cycles for a positive coefficient and 49 cycles for a negative coefficient.
0 Kudos

1,471 Views
CompilerGuru
NXP Employee
NXP Employee
Is this for a HC08 or a S08?
At least for an S08, STA DIR is just 3 cycles, then it would even help 2 cycles to store the multiplication result into the zero page instead of the stack.

Daniel
0 Kudos

1,471 Views
rocco
Senior Contributor II
Oh, and here is a routine that negates the 16-bit value, using H:X
        com     value16        ; 4  - compliment the high byte        ldhx    value16        ; 4  - load all 16 bits into H:X        comx                   ; 1  - compliment the low byte        ais     #1             ; 2  - increment all 16 bits
        sthx    value16        ; 4  - save it in place (optional)
;                              =15

On the HC08, it takes only 15 cycles. It's a first draft, so it might be able to be improved upon.
0 Kudos

1,471 Views
peg
Senior Contributor IV
Hi rocco,
 
That would be aix (not ais)
Still time to edit?
 
0 Kudos

1,471 Views
rocco
Senior Contributor II
OOps. Thanks Peg.

Unfortunately, it is too late to edit, although it has only been 45 minutes. What is the timeout, anyway?

So here is the corrected version:
        com     value16        ; 4  - compliment the high byte        ldhx    value16        ; 4  - load all 16 bits into H:X        comx                   ; 1  - compliment the low byte        aix     #1             ; 2  - increment all 16 bits        sthx    value16        ; 4  - save it in place (optional);                              =15

 AIX and AIS both take 2 cycles.

0 Kudos

1,471 Views
peg
Senior Contributor IV
Hi,
I believe the edit time depends on the relative alignment of the planets. Whether or not pluto is included in the calculation seems to vary. The short answer - Try it!
 
0 Kudos

1,471 Views
bigmac
Specialist III
Hello all,
 
Since the negated 16-bit value must then be added to the 24 bit accumulator SUM, I would suspect that the negation process should also include the instruction COM SUM, an additional 5 cycles, to prepare for the addition.
 
Because the 16-bit value resides in X:A after the MUL instruction, the following represents my thoughts on the negation process -
 
     COMX       ; 1 cycle
     NEGA       ; 1 cycle
     BNE  *+3   ; 3 cycles
     INCX       ; 1 cycle
     COM  SUM   ; 5 cycles
 
If my reckoning is correct, in the case of a negative coefficient, this should give a total cycle count of 53/54 for the routine.

Another possible reason for the cycle count discrepancy - when I estimate the cycle count of a sub-routine, I always allow for a JSR instruction (6 cycles) to call the routine.
 
Regards,
Mac
 


Message Edited by bigmac on 2007-06-18 01:45 PM
0 Kudos

1,471 Views
JohanF
Contributor I
Hi,
 
I have to ask the question, with the risk of looking stupid. Why to you need "COM SUM". To subtract a value from SUM, I would just take the 2-complement of the value and add to SUM, without manipulating SUM first.
What is your opinion?
 
/Johan 
0 Kudos

1,471 Views
CompilerGuru
NXP Employee
NXP Employee
Hmm I think the COM SUM is a typo, it should be a DEC SUM instead.
Basically just adding the complement is correct, but you have to add the 24 bit complement. As we know
the multiplication as 24 bit signed value is negative for that case, adding the high byte means adding $FF to the high byte, so its the same as a DEC SUM.
Hmmm. That DEC SUM could probably be combined with the existing conditional INC SUM to just a conditional DEC SUM.

No guarantee that the attached code is correct. I get 45 cycles worst case for the subtraction case.
0 Kudos

1,471 Views
JohanF
Contributor I
Yes, off course, sometimes I just don't think it through!
When adding a negative 16-bit value to a 24-bit sum, the 16-bit value must naturally be sign-extended to 24-bits first, otherwise we have a serious bug.
That means it should probably be a DEC SUM, or something like that.
0 Kudos

1,471 Views
bigmac
Specialist III
Hello Daniel,
 
Yes, you are correct about the use of DEC SUM - I was wrong.
 
Your other alternative of using zero page RAM in lieu of stack, for temporary storage - I cannot see any advantage.  Using the stack requires 2 cycles for each push, and 4 cycles for the subtract.  Using zero page requires 3 cycles for store, and 3 cycles for subtract.  The situation seems to be identical for both HC08 and S08.
 
For the code given, the differences between HC08 and S08 appear to be -
Calling JSR - HC08 = 5 cycles, S08 = 6 cycles
DEC SUM or INC SUM - HC08 = 4 cycles, S08 = 5 cycles
RTS - HC08 = 4 cycles, S08 = 6 cycles
 
For the final "negate version" of the routine, these are the cycles I get -
HC08:
Negative coefficient:  40, 41, 44 or 45 (44 mostly)
Positive coefficient:  35 or 39 (35 mostly)
S08:
Negative coefficient:  43, 44, 48 or 49 (48 mostly)
Positive coefficient:  38 or 43 (38 mostly)
 
Regards,
Mac
 
0 Kudos

1,471 Views
CompilerGuru
NXP Employee
NXP Employee
The benefit of the zeropage version is that it does not have to cleanup the stack, it does not need the AIS #2 and is 2 cycles quicker therefore.

Daniel
0 Kudos