The EMACS instruction has no guard bits natively implemented, to accommodate overflow when accumulating the 32-bit result.
Has anyone worked out the logic to implement a "manual" guard byte?
I'm interested in the fastest method possible, since my EMACS instruction is in the "hot spot" in a FIR filter loop.
It seems that the basic strategy is to check the overflow flag after the EMACS, and increment (or decrement) the guard byte. The EMACS accumulator in memory also needs to be adjusted correspondingly.
I should have mentioned that I'm using the S12X, and speed is a critical issue. I considered using just a plain EMULS and doing the addition "by hand", but that seems to end up costing more cycles than just trying to build off the EMACS (which does 32-bits of the addition for only 8 cycle cost).
After thinking more about the accumulator overflow, I've come up with something like the code below. When I carry or borrow into the high-order guard byte, I also adjust the EMACS accumulator by 0x8000:0000.
1. I believe that the same adjustment of 0x8000:0000 is needed for both overflow and underflow. ie, I don't think I need to use ox7FFF:FFFF in the underflow case.
2. I believe that doing the bit set (or bit clear) is the same as subtracting 0x8000:0000 from the accumulator in memory, but much faster. Does anyone see a problem with this?
3. Does anyone see a better (faster) way to do the branching? It seems that I need to separately determine if overflow occured, and then in which direction.
??? ; Adjust IX for next multiplier
??? ; Adjust IY for next multiplicand
EMACS accum ; Sets the overflow flag.
BVC Loop ; [3/1] Branch if no overflow
; There was an overflow. Determine direction of overflow.
BMI IsNeg ; [3/1] Branch if overflowed in negative direction
; Overflow direction was positive.
INC guard ;  Overflowed in the positive direction.
BCLR accum,0x80 ;  Clearing the MSB is same as subtracting 0x8000:0000 from accum, but faster.
BRA Loop ; 
: Overflow direction was negative (ie, underflow)
DEC guard ;  Underflowed in the negative direction.
BSET accum,0x80 ;  Setting the MSB is the same as adding 0x8000:0000 to accum, but faster.
BRA Loop ; 
Worst case time after EMACS instruction for guard adjust is  cycles.
yes, adjusting accumulator on overflows is absolutely necessary. When adding +ve to +ve overflows to -ve, acc should be adjusted and kept +ve, else next time you add +ve to -ve(which should be +ve but is -ve due previous overlow), overflow wan't happen when it has to happen and guard count will be lost.
Interesing idea and looks like it is working on paper.
1. Yes, adding 0x8000:0 should work in both cases. -0x60...-0x20...=-0x80... is not an overflow case. And 0x60+0x20=-0x80 is an overflow case.
2. Yes, bset/bclr #0x80 can replace adding 0x8000:0. But I see them reversed in your code. In IsNeg case you should clear MSB bit (because it is already set) and in +ve case you should set it.
On S12X EMULS code is bit faster. For 48bits accumulator, X-reg not touched:
TFR Y,D ; 16->32 bit SEX source is D reg only
SEX D,Y ; sign extend product
> yes, adjusting accumulator on overflows is absolutely necessary.
> When adding +ve to +ve overflows to -ve, acc should be adjusted and kept +ve,
> else next time you add +ve to -ve(which should be +ve but is -ve due previous overlow),
> overflow wan't happen when it has to happen and guard count will be lost.
I cannot follow, what is +ve? Some huge value which overflows when added?
I consider that guard counts the number of overflows outside of the 32 bit signed range. It gets incremented when the addition results in a value >= 2^31, it gets decremented when the result is < -2^31. When the 32 bit accumulator crosses 0 from -ve to +ve no overflow takes place, seems expected and ok to me.
The guard count is not just some additional bits for the sum, instead the value is
val = sval32 + 2^32*guard
with sval32 the signed accumulator and guard the overflow count (incremented for overflow, decremented for underflow, initially 0).
For val = 40000, sval32 would be -25536 and guard == 1. The guard is not the same as bits 32..39 of the summand, it is off by one if sval32 < 0.
So appart from having to treat guard properly in the end when comparing the final result with a 40 bit value, I don't see why there adjustment is necessary.
Still wondering if I miss something here.
thank you very much, I understand signed overflow features better now.
Sorry for +ve and -ve. It's positive (+ve) and negative (-ve). I saw it in some (non Freescale probably) datasheet and thought everyone is familiar with this.
Since oVerflow bit is set in two cases: a) in case when adding positive number to positive number gives negative result (+ve + +ve = -ve); b) in case when adding negative number to negative number gives positive result (-ve + -ve = +ve). I was concerned that it may be the case when V bit will be falsely triggered or falsely not triggered, because on overflow sign of accumulator is sort of lost. In fact it is way simplier. The sign of result isn't lost and is encoded in two places: in the sign of accumulator and in the guard count. That's why I used to thing that sign of accumulator should be preserved by restoring it by adding or subtracting 0x80000000 on each overflow.
Summarizig, please correct me if I'm wrong:
1. On overflow, in case
result is negative: increment guard count because addend was positive
result is positive: decrement guard count because addend was negative
2. To convert accumulator and guard count to more bits number, first we need to sign extend accumulator, then add guard count to higher order bits
For example for signed octets: 0x70 + 0x70 = 0xE0, V = 1, N = 1, Guard+1 = 1
16bits result: 1) sign extending 0xE0 to 16bits gives negative 0xFFE0. 2) 0xFFE0 + 2^8*Guard = 0x00E0
Is the adjustment of the accumulator for the overflow case necessary? Or is it not sufficient to just increment/decrement the guard (and considering the guard to be the bits 32..39 of the value)
and to consider the guard as well when reading the final answer?
You could also store the guard value in the A or B registers if those are free.
Then the loop can be rotated so it starts with IsNeg. The advantage is that both "worst cases" now need the same number of branches and the worst case is therefore a bit better.
CLRA IsNeg: DECA ;guard ;  Underflowed in the negative direction.Loop EMACS accum ;  Sets the overflow flag. BVC Loop ; [3/1] Branch if no overflow ; There was an overflow. Determine direction of overflow. BMI IsNeg ; [3/1] Branch if overflowed in negative direction ; Overflow direction was positive. INCA ;guard ;  Overflowed in the positive direction. BRA Loop ; 
It is not as simple as to inc/decrement high bits on signed oVerflow. After accumulator overflows; next time you EMACS, accumulator still can be + or - overflowed, but V bit won't be set. For example with overflow at +-100, you add 80 + 80 and get V=1 and -40. Next time you add -40 + 80 and get V=0 and +40, but you still have to increment hi bits...
I think it would be simplier to just use EMULS instruction, sign extend product, and add it to accumulator. For 40bits accumulator and non-S12X parts it should be something like this:
SEX A,X ; sign extend product to X