Hello Curt,
While the AN1219 code is relatively efficient, I notice that the Pack() and Unpack() function are certainly not. In fact, the four calls seem to require a total of about 9000 cycles, compared with 3500+ cycles for the 32 x 16 division function.
Note that the latter figure could be improved somewhat with the allocation of zero page RAM, and further improvement with the use of instructions applicable to zero page RAM, probably a total saving of about 500 cycles.
However, the situation is simplified if a structure is created, to hold the results during the calculation. The quantities can then be directly written to the required locations within the structure, and directly read back at the completion of the calculation. The inline assembly code portion extensively uses indexed addressing. As a result, the code that previously required a total of about 12k cycles, now requires about 3500.
When this code is extended to 64 x 32 division, the number of cycles required is about 11k.
Regards,
Mac