Thanks again for all the inputs, and it's nice to see a friendly debate!
I now see how to do the accumulation to more than 8 bits.
I suppose I was rather "leading the witness" with my question about right-shifting to do the divide. I had expected there was a way to do as many right-shifts as wanted in one operation, but that was perhaps overly optimistic.
I had been lead to believe that division was slow at this level, but looking at the operations available, it seems to me that I may as well use the DIV in this case, because if I take 8 samples, then I need to do 3 LSRs and 3 RORs, each of which is 4 cycles, so a total of 24 cycles, whereas I think (please tell me if this won't work):
LDHX accumulated total (2 bytes)
LDA lower byte of accumulated total
LDX number of samples (so number to divide by)
DIV
then this will be 4+3+3+7 = 17 cycles.
(I've based the above on the data sheet which describes DIV as A<-
(H:A)/(X) but I may have misunderstood.)
This also allows more flexibility of the number of samples, e.g. I'm not restricted to 2, 4, 8, 16, ... as I would be if I was right-shifting.
In terms of dither, I think that in my real world application of reading sensors, they will be changing every sample, so dither won't be an issue. I am happy enough to get an average of however many samples.
My device is battery-operated, and away from the mains, so I don't think mains hum will be an issue.
Just to try to add something to the filter discussion, I have seen use of the following algorithm:
Constant A1 = exp(-sample time/filter time constant)
Constant B0 = 1 - A1
filtered value = A1 * previous filtered value + B0 * new unfiltered value
which is apparently from "Introduction to Dynamics and Control", Power and Simpson, McGraw Hill.
For interest, is this algorithm any good, or is it junk? Would this be implemented in Assembly by scaling A1 and B0 to a proportion of 256, then multiplying and adding, then just taking the upper byte?
I'll carry on trying to make sense of all that everyone has offered.