How fast is fast?
This really boils down to dividing the binary number by ten four times and saving the remainders as the BCD digits, something which is straightforward on the HCS08 given the fast division instruction. The only "trick" is that the two of divisions by ten may overflow (e.g. the 16/8-bit division may generate larger than 8-bit results) and therefore need to be synthesized through pairs of divisions where the remainder from the MSB carries into the LSB division. Some contortions are needed to get the right values into the proper places, but that's just how optimized code goes on the terminally register-starved S08.
The following is my half-baked attempt which takes the 16-bit bin variable as input and spits out a 3-byte (big-endian) bcd buffer as output. It should take 103 cycles assuming zero-page variables.
lda bin+0 ;1st division (0..65535 / 10)
ldhx #10
div
psha
lda bin+1
div
sta bcd+1
pula
pshh
ldhx #10 ;2nd division (0..6553 / 10)
div
psha
lda bcd+1
div
sthx bcd+1
pulh ;3rd division (0..655 / 10)
div
pshh
ldhx #10 ;4th division (0..65 / 10)
div
pshh
sta bcd+0 ;Save and pack and the final BCD nibbles
tsx
lda bcd+1
nsa
ora 2,x
sta bcd+2
pula
nsa
ora 1,x
sta bcd+1
ais #2