I typically use test points for timing. My HW guy and I agree that if we have a spare pin we should use it for a generic test point. I'm typically looking to see that things happen well on a busy system.
I've been using CW with beans, and I find that a TP generally gets implemented as a single indivisible instruction on the S08's I've been using lately. On other architectures there is some page accessing or other overhead in the "poke" to pin.
Depending on the accuracy you need, you may want to just wiggle you test point and determine the minimum on time of the TP, and subtract this from your execution time when you measure your function.
Using internal timers is good, but you have to get the results back out. You will either want an array to save your timing results, say for the first hundred iterations and see if it comes out even.