Hello out there,

I need a fast executing assembly code for signed multiplication (8bitx8bit) using the MUL instruction

/Johan

Hello out there,

I need a fast executing assembly code for signed multiplication (8bitx8bit) using the MUL instruction

/Johan

- Hello Johan, and welcome to the forum.As mentioned by Daniel, an approach to 8-bit signed multiply, with 16-bit result, is to convert any negative input value to positive, multiply, then process the signs separately, and convert the result to a negative value, if required.The following assembly routine appears to work - and assumes ACC and X contain the 8-bit signed values. On exit, X:A should contain the signed result.SMUL8: AIS #-1

CLR 1,SP ; Sign calculation on stack

TSTA

BPL SM1 ; Branch if ACC is positive

NEGA

COM 1,SP

SM1: TSTX

BPL SM2

NEGX

COM 1,SP

SM2: MUL

TST 1,SP ; Test sign of result

BPL SM3 ; Exit if positive

COMA

COMX

ADD #1

BCC SM3

INCX

SM3: AIS #1 ; Adjust stack pointer

RTSAs shown, the routine will take between 43 and 65 cycles for a HCS08 device. Using a temporary, zero page RAM register, in lieu of the stack, will save a few cycles.Regards,Mac

- The CW _BMULS does take advantage of that the unsigned 8x8=16 and the signed 8x8=16 result are closely related. So it does the unsigned multiplication first and then patches the resulting high byte to match the signed multiplication behavior. It's 23 bytes, I did not count the cycles.

Daniel- Thanks again,Where do I find this lib\hc08c\src\rtshc08.c? Is it somwhere in the freescale webpage?/Johan
- Hello,The rtshc08.c file can be found within the CW installation. Once you find the file, you will need to search for the function _BMULS. This is an inline assembler routine within a C function wrapper.For those interested, I have done a performance comparison between the three methods outlined -Daniel:Size: 26 bytesBoth values positive: 46 cyclesOne value negative: 50 cyclesBoth values negative: 54 cyclesTonyP:Size: 30 bytesBoth values positive: 47 cyclesOne value negative: 54 or 55 cyclesBoth values negative: 49 cyclesBigmac:Size: 35 bytesBoth values positive: 43 cyclesOne value negative: 57 or 58 cyclesBoth values negative: 57 cyclesAll results are based on the HCS08, and assume the routine is called with a JSR instruction (6 cycles). Not surprisingly, the performance differences are small, with the method outlined by Daniel having the best overall efficiency (as might be expected).Regards,Mac
- When manually counting the bytes in _BMULS I do get to 26 bytes too, however when the compiler compiles it does

perform some frame optimizations (replace SP accesses with H:X relative ones), so the compiler does really emit 23 bytes, I did count correctly .

I think the frame optimization is the only optimization done for HLI by default, it can be disabled with -onx.

BTW. The TonyP version can save a byte by combining the inca and the coma to a nega

Daniel

- Interesting...My application is that I want to calculate a DFT to detect a few discret fequencies, from a set of sampled signals.The signalsamples are 8 bits, always positive.The phasekoefficients are 8 bits either positive or negative.So the algoritm is to multiply each sample by its pasekoefficient (into 16 bits result) and add them all together (into 24bits sum).Based on the number of clockcycles the signed multiplication algoritm takes, I believe I will get away with less clockcycles by:1. Check sign of phasecoefficient. If negative, negate it to make it positive2. Calculate unsigned multiplication of phasecoefficient and sample.3. If the phasecoefficient was positive from the beginning, add 16bit result to the sum.4. If the phasecoefficient was negative from the beginning, subtract 16bit result from the sum.As my application is very timecritical, I really need to minimize processing time. In this algoritm, what takes the most time is the adding or subtracting of 16bits to the 24bits sum, but I have to do that weather or not I use a signed multiply or if I do according to above.
- Hello Johan,If my understanding of your proposal is correct, there appears to be some advantage in arranging so that the sign of the coefficient be placed in a separate register prior to DFT processing. This should increase the (unsigned) resolution of the coefficient from 7-bits to 8-bits.The following code would seem to achieve what you describe -;**************************************************************

; ACCUMULATE SCALED SAMPLE VALUE TO 24-BIT SUM

; On entry,

; ACC = sample value (8-bit unsigned),

; X = Phase coefficient (positive 8-bit value),

; PH_SIGN register (MSB) contains the sign of the coefficient.; On exit, 24-bit value of SUM is updated.

; RAM registers PH_SIGN and SUM must reside in page zero.PROCSAMP: MUL

BRCLR 7,PH_SIGN,PS1 ; Branch if positive sign; Subtract scaled result from 24-bit sum

PSHA

PSHX

LDA SUM+2

SUB 2,SP

STA SUM+2

LDA SUM+1

SBC 1,SP

STA SUM+1

BCC *+4 ; Skip next if no carry

DEC SUM

AIS #2 ; Adjust stack pointer

RTSPS1: ; Add scaled result to 24-bit sum

ADD SUM+2

STA SUM+2

TXA

ADC SUM+1

STA SUM+1

BCC *+4 ; Skip next if no carry

INC SUM

RTS

The processing of the routine should take 43 cycles for a positive coefficient, and 56 cycles for a negative coefficient. The question is can you live with this sort of sample processing period?I also looked at the possibility, in the case of a negative coefficient, of negating the result of the multiplication, and then adding to SUM - this would appear to save only two cycles (assuming my coding is correct), and also quite a few bytes of code.Regards,

Mac

Message Edited by bigmac on 2007-06-17 02:02 AM- You have understood correctly. Your code is almost identical to what I came up with (but you won by a couple of cycles...).I have pasted your code below and calculated cycles, I get 7 cycles less than you have calculated in both cases. Am I wrong?Are you willing to share your code for taking the negative of a 16-bit number? I think I would benefit by gaining even just a few clockcycles for the negative coefficient.PROCSAMP: MUL ; 5 cycles

BRCLR 7,PH_SIGN,PS1 ; Branch if positive sign ; 5 cycles

; Subtract scaled result from 24-bit sum

PSHA ; 2 cycles

PSHX ; 2 cycles

LDA SUM+2 ; 3 cycles

SUB 2,SP ; 4 cycles

STA SUM+2 ; 4 cycles

LDA SUM+1 ; 3 cycles

SBC 1,SP ; 4 cycles

STA SUM+1 ; 4 cycles

BCC *+4 ; Skip next if no carry ; 3 cycles

DEC SUM ; 4 cycles

AIS #2 ; Adjust stack pointer ; 2 cycles

RTS ; 4 cycles

PS1: ; Add scaled result to 24-bit sum

ADD SUM+2 ; 3 cycles

STA SUM+2 ; 4 cycles

TXA ; 1 cycle

ADC SUM+1 ; 3 cycles

STA SUM+1 ; 4 cycles

BCC *+4 ; Skip next if no carry ; 3 cycles

INC SUM ; 4 cycles

RTS ; 4 cycles

I get the result 36 cycles for a positive coefficient and 49 cycles for a negative coefficient.

You did not mention which processor you were using, but Mac mentioned that his cycle count was based on the S08. Could your cycle counts be based on the HC08? There is a small difference in cycles between the two versions for some instructions. I did not count cycles myself, as I bet you and Mac are both correct.

As for negating a sixteen-bit number, here is the code I have been using since the early eighties:** Negate the 16 bit signed integer in the psuedo-accumulator.*NEG16 MACRO COM ACCUM1 ; 4 LDA ACCUM0 ; 3 COMA ; 1 ADD #1 ; 2 STA ACCUM0 ; 3 LDA ACCUM1 ; 3 ADC #0 ; 2 STA ACCUM1 ; 3 ENDM ; =21*

This is a macro that negates the low sixteen bits of my 32-bit pseudo-accumulator, but it should be easy to adapt. It takes 21 cycles on an HC08 (I don't use any S08s yet).

If you find a method that is faster, please post it, as everything I do is math-intensive.

- On the HC08, it takes only 15 cycles. It's a first draft, so it might be able to be improved upon.
com value16 ; 4 - compliment the high byte ldhx value16 ; 4 - load all 16 bits into H:X comx ; 1 - compliment the low byte ais #1 ; 2 - increment all 16 bits

sthx value16 ; 4 - save it in place (optional)

; =15

Unfortunately, it is too late to edit, although it has only been 45 minutes. What is the timeout, anyway?

So here is the corrected version:com value16 ; 4 - compliment the high byte ldhx value16 ; 4 - load all 16 bits into H:X comx ; 1 - compliment the low byte aix #1 ; 2 - increment all 16 bits sthx value16 ; 4 - save it in place (optional); =15

AIX and AIS both take 2 cycles.- Hello all,Since the negated 16-bit value must then be added to the 24 bit accumulator SUM, I would suspect that the negation process should also include the instruction COM SUM, an additional 5 cycles, to prepare for the addition.

Because the 16-bit value resides in X:A after the MUL instruction, the following represents my thoughts on the negation process -COMX ; 1 cycleNEGA ; 1 cycleBNE *+3 ; 3 cyclesINCX ; 1 cycleCOM SUM ; 5 cyclesIf my reckoning is correct, in the case of a negative coefficient, this should give a total cycle count of 53/54 for the routine.

Another possible reason for the cycle count discrepancy - when I estimate the cycle count of a sub-routine, I always allow for a JSR instruction (6 cycles) to call the routine.Regards,Mac

Message Edited by bigmac on 2007-06-18 01:45 PM- Hi,I have to ask the question, with the risk of looking stupid. Why to you need "COM SUM". To subtract a value from SUM, I would just take the 2-complement of the value and add to SUM, without manipulating SUM first.What is your opinion?/Johan
- Hmm I think the COM SUM is a typo, it should be a DEC SUM instead.

Basically just adding the complement is correct, but you have to add the 24 bit complement. As we know

the multiplication as 24 bit signed value is negative for that case, adding the high byte means adding $FF to the high byte, so its the same as a DEC SUM.

Hmmm. That DEC SUM could probably be combined with the existing conditional INC SUM to just a conditional DEC SUM.

No guarantee that the attached code is correct. I get 45 cycles worst case for the subtraction case.- Thanks to all for the interesting discussion.Are there any tools commonly available to calculate raw cycles for a given asm code fragment and a given processor, or is this done by hand in each case?And just out of curiosity: what kind of performance does "C" deliver on this problem?Regards,Curt
- Hello Curt,The method I used to determine bus cycles consumed was to run the code in the debugger under full chip simulation, and then single step through the code, whilst noting the cycle count. But you still have to be aware of whether conditional branches are taken, or not, to ascertain the worst case. By single stepping, you can immediately see the number of cycles for each instruction.This method worked fine for the signed multiplication routines, that did not loop. If the code makes a considerable number of loops in normal operation, you migh single step the first time through the loop, and then set a breakpoint to exit the loop.Similar methods can also be used for C code.Regards,Mac

Message Edited by bigmac on 2007-06-19 02:15 AM- The debugger has a RESETCYCLES command which can be used to set the start cycle count to 0, that's what I did use.

Also the trace window is quite nice, open it ("open trace" in the command window), choose enable trace in the context menu and also go to Instruction mode (also context menu).

Then step or run through the function and you see where the cycles have been spent.

Alternatively, both the compiler and the decoder support to annotate the cycles in their listing file.

In case you are using the assembler, the decoder can be used. While this avoids that you have to lookup the instruction details, it does not actually add up the spent cycles.

- Hello Daniel,Yes, you are correct about the use of DEC SUM - I was wrong.Your other alternative of using zero page RAM in lieu of stack, for temporary storage - I cannot see any advantage. Using the stack requires 2 cycles for each push, and 4 cycles for the subtract. Using zero page requires 3 cycles for store, and 3 cycles for subtract. The situation seems to be identical for both HC08 and S08.For the code given, the differences between HC08 and S08 appear to be -Calling JSR - HC08 = 5 cycles, S08 = 6 cyclesDEC SUM or INC SUM - HC08 = 4 cycles, S08 = 5 cyclesRTS - HC08 = 4 cycles, S08 = 6 cyclesFor the final "negate version" of the routine, these are the cycles I get -HC08:Negative coefficient: 40, 41, 44 or 45 (44 mostly)Positive coefficient: 35 or 39 (35 mostly)S08:Negative coefficient: 43, 44, 48 or 49 (48 mostly)Positive coefficient: 38 or 43 (38 mostly)Regards,Mac

- I did look it up, and on a HC08 STA DIR is 3 cycles as well, so placing the multiplication result into the zero page does help for the HC08 too.

Also note that the STA is one cycle faster than noted in the previous message.

>STA SUM+2 ; 4 cycles

Should be

STA SUM+2 ; 3 cycles

Make sure the assembler is using DIR addressing mode for that operation, with EXT it is really 4 cycles but it does buy anything as the INC/DEC needs DIR anyway.

Daniel

Especially the 8x8=8 bit signed multiplication in simple, its the same as the unsigned one

For the 8x8=16 bit signed multiplication, the CW _BMULS runtime routine in lib\hc08c\src\rtshc08.c does just that, it first multiplies unsigned and then adapts the result if the operands have been negative.

And are both of your operands eventually negative, or is one known to be positive?

What do you need it for?

Daniel

Message Edited by CompilerGuru on 2007-06-14 11:21 PM