Hi,
Delay loops can be very tricky. Instructions all take different times (see section 7.3.4 of the reference manual), but the worst thing is as you change compile options (like optimized code for production and unoptimized code for debug), or as you change system configuration options, all the delay loops get confused.
For this reason, when my system boots, I "calibrate" a delay loop based on (on the QE128) a real-time-clock interrupt. I configure the rtc to interrupt every 1ms and then run a delay loop for 20ms (largely unnoticed at boot) and figure out my calibration factor. Then if I ever attempt to delay with interrupts disabled, I fall back to the delay loop (otherwise I wait for the rtc to count off the appropriate number of 1ms interrupts).
If you want to see my code (it supports way more than the QE128, so you can ignore everything else), main.c shows the delay loop calibration, timer.c shows the rtc interrupt, and util.c shows the delay() function.
To answer your specific question, on the QE128, the CPU core (whose timing is shown in 7.3.4) runs with the ICSOUT -- see figure 1-2 and figure 1-3 in the RM. One thing I feel compelled to mention here (since I got bit by it) is that there is a (fixed) divide by 2 for the internal peripherals *after* the BDIV divide, which if you are like me you want set to 1:1 (which is not the default).
-- Rich