When I first read your post I assumed you'd found the CFV3 "random branch time bug" that I detailed here:
https://community.nxp.com/thread/324006#404184
The problem with the CFV3 core is that they added a bit in the CONDITION REGISTER that flips the branch prediction, and some "optimised" compiler code can load that with garbage, changing the prediction when you don't want it to.
But you're using the CFV4, and it doesn't seem to have this sort of bug. It has a more complicated one.
I see you're demonstrating this branch-prediction change by using a rather complicated debugger, which seems to be doing "real time tracing" of the CPU. Are you sure it isn't causing the problem? Are you sure this change in execution speed also happens when the CPU is running on its own without the debugger? Can you prove that?
If that isn't the problem, I would guess that the branch cache entry is being invalidated by the interrupt, and isn't being refreshed when your loop resumes. It is possible the cache only works properly when the code is running linearly and then finds the branch instruction. and that returning from an interrupt "in the middle of a loop isn't something it can handle.
But if the cache wasn't working at all I'd expect a far higher delay of up to 8 clocks. The CPU has a "Branch Cache" and a "Prediction Table". I suspect the CPU may be using the latter in your case, taking one extra clock.
> Do you have an idea what we can do to solve this problem?
Deal with it. Don't expect "constant execution speed" on modern CPUs. If you need precise timing, then set up a hardware counter and read it for timing. You might be able to use a GPT or an SLT.
Otherwise, if you do need a "constant execution time loop", see if making the loop longer by adding NOPs [1] can get it to fix the problem and reload the cache (or whatever is going wrong). Use different assembly codes.
I had a similar problem with the i.MX CPU. It is running Linux and has a "Bogomips" loop which it uses for short delays, and calibrated on startup. it is a "countdown loop" like yours. It runs a different number of clocks when compiled on different code boundaries, probably related to the cache line length. It doesn't change during execution, but it is different for different kernel compiles. We just had to deal with it.
Note 1: Don't use the "NOP" instruction as a "NOP". NOP doesn't do nothing on this CPU as it "Synchronises the Pipeline". It takes SIX clocks. But it might do what you want and give you constant execution timing, so try it. If you want a real "NOP" then use "TPF". See the Programmer's Reference Manual for details.
Tom