> I understand the 'Forward' and 'Backward', but I don't konw what the 'Taken' and 'Not Taken' mean.
If the condition is "true" (for "bne" the test was not-equal) then the branch is "taken" and the next instruction is the target of the branch. If the condition is "false" (it was equal) then the branch is "not taken" and the next instruction is the one following the branch.
> what differences between V2 and V4 core in Bcc Instruction Execution Times?
"What" or "Why"? I assume you mean "Why" because you answered the "What" in the tables you included in your post.
Reading the "Core Overview" chapters of representative manuals gives the following differences:
The V1 ColdFire core pipeline stages include the following:
- Two-stage instruction fetch pipeline (IFP) (plus optional instruction buffer stage)
- Two-stage operand execution pipeline (OEP)
The V2 ColdFire core pipeline stages include the following:
- Two-stage instruction fetch pipeline (IFP) (plus optional instruction buffer stage)
- Two-stage operand execution pipeline (OEP)
The V3 ColdFire core pipeline stages include the following:
- Four-stage instruction fetch pipeline (IFP) (plus optional instruction buffer stage)
- Two-stage operand execution pipeline (OEP)
V4 architecture features are defined as follows:
- Two independent, decoupled pipelines—four-stage instruction fetch pipeline (IFP) and five-stage
operand execution pipeline (OEP) for increased performance
- Ten-instruction, FIFO buffer that decouples the IFP and OEP
- Limited superscalar design approaches dual-issue performance with the cost of a scalar execution
pipeline
- Two-level branch acceleration mechanism with a branch cache, plus a prediction table for
increased performance of conditional Bcc instructions
The deeper the pipeline the faster the CPU can run, until you derail it with a conditional branch. Then it has to pick itself up and start all over. A misprediction in the 2+2-stage CF2 costs one extra clock. In the 4+2-stage CF3 it costs an extra 4 clocks. In the 4+5-stage CF4 it costs an extra SEVEN clocks, so they added a Branch Cache to make this less of a problem.
You got the CF4 "Predicted Incorrectly" entry wrong in your table. It is not "1(0/0)". It is "8(0/0)".
> How much clock cycle does 'bne lup2 ' Instruction execution take in MCF52258 and MCF54415
Branch Backwards Taken, so 2 clocks for CF2 and 0 clocks for CF4 once the Branch Cache is loaded. It will take a LOT longer the first time while the cache line holding the instruction sequence gets loaded from memory, maybe 20 to 30 clocks on a CF4.
For reference, the CF3 is worse than either the CF2 or CF4. It takes "1(0/0" for "Forward Not Taken" and "Backward Taken" branches, and "5(0/0)" for the other combinations. Except when it is the reverse, which happens randomly if you're using the gcc compiler.
The CF3 design added a "swap the prediction" bit in the CCR which the gcc compiler randomly flips around when performing some bit-test-and-branch instructions. That makes the branches take FIVE times as long as they should, depending on what the compiler did and the previous code flow:
https://community.freescale.com/message/404184#404184
> void time_delay(long delay:__D0)
Delay loops are usually a bad idea. It is best to avoid them if you can. Why do you need them? There's usually a few spare PIT or DMA Timers in the CPU that can be set up to free-run at 1MHz (or more) and thus allow precise microsecond delays where required. The other approach (where very short delays are required) are like Linux does with its "Bogomips" counter. It calibrates the delay loop on power up from a hardware timer. This can't work with the CF3 and gcc as loops like yours sometimes take 2 clocks and other times take 6 clocks.
Tom