Bcc Instruction Execution Times

leocheng · ‎03-24-2015

I want to know Bcc Instruction Execution Times in MCF52258. So I referred to MCF52259RM Rev4 and ColdFire Family Programmer's Reference Manual.

Table 3-19. Bcc Instruction Execution Times

Opcode

Forward

Taken

Forward

Not Taken

Backward

Taken

Backward

Not Taken

Bcc

3(0/0)

1(0/0)

2(0/0)

3(0/0)

I understand the 'Forward' and 'Backward', but I don't konw what the 'Taken' and 'Not Taken' mean.

By the way, what differences between V2 and V4 core in Bcc Instruction Execution Times?

Opcode

Branch Cache

Correctly Predicts

Taken

Prediction Table

Correctly Predicts

Taken

Predicted

Correctly as Not

Taken

Predicted

Incorrectly

Bcc

0(0/0)

1(0/0)

There is a code snippet, How much clock cycle does 'bne lup2 ' Instruction execution take in MCF52258 and MCF54415?

void time_delay(long delay:__D0)

{

asm

{

move.l d3,-(a7) // save D3 on stack

lup1: // outer loop

// move.l #17,d3 // for 75 MHz MCF5232

move.l #36,d3 // for 100 MHz MCF54415

lup2: // inner loop

subi #1,d3

bne lup2

subi #1,d0

bne lup1

move.l (a7)+,d3 // restore D3

}

TomE · ‎03-24-2015

> I understand the 'Forward' and 'Backward', but I don't konw what the 'Taken' and 'Not Taken' mean.

If the condition is "true" (for "bne" the test was not-equal) then the branch is "taken" and the next instruction is the target of the branch. If the condition is "false" (it was equal) then the branch is "not taken" and the next instruction is the one following the branch.

> what differences between V2 and V4 core in Bcc Instruction Execution Times?

"What" or "Why"? I assume you mean "Why" because you answered the "What" in the tables you included in your post.

Reading the "Core Overview" chapters of representative manuals gives the following differences:

The V1 ColdFire core pipeline stages include the following:

- Two-stage instruction fetch pipeline (IFP) (plus optional instruction buffer stage)

- Two-stage operand execution pipeline (OEP)

The V2 ColdFire core pipeline stages include the following:

- Two-stage instruction fetch pipeline (IFP) (plus optional instruction buffer stage)

- Two-stage operand execution pipeline (OEP)

The V3 ColdFire core pipeline stages include the following:

- Four-stage instruction fetch pipeline (IFP) (plus optional instruction buffer stage)

- Two-stage operand execution pipeline (OEP)

V4 architecture features are defined as follows:

- Two independent, decoupled pipelines—four-stage instruction fetch pipeline (IFP) and five-stage

operand execution pipeline (OEP) for increased performance

- Ten-instruction, FIFO buffer that decouples the IFP and OEP

- Limited superscalar design approaches dual-issue performance with the cost of a scalar execution

pipeline

- Two-level branch acceleration mechanism with a branch cache, plus a prediction table for

increased performance of conditional Bcc instructions

The deeper the pipeline the faster the CPU can run, until you derail it with a conditional branch. Then it has to pick itself up and start all over. A misprediction in the 2+2-stage CF2 costs one extra clock. In the 4+2-stage CF3 it costs an extra 4 clocks. In the 4+5-stage CF4 it costs an extra SEVEN clocks, so they added a Branch Cache to make this less of a problem.

You got the CF4 "Predicted Incorrectly" entry wrong in your table. It is not "1(0/0)". It is "8(0/0)".

> How much clock cycle does 'bne lup2 ' Instruction execution take in MCF52258 and MCF54415

Branch Backwards Taken, so 2 clocks for CF2 and 0 clocks for CF4 once the Branch Cache is loaded. It will take a LOT longer the first time while the cache line holding the instruction sequence gets loaded from memory, maybe 20 to 30 clocks on a CF4.

For reference, the CF3 is worse than either the CF2 or CF4. It takes "1(0/0" for "Forward Not Taken" and "Backward Taken" branches, and "5(0/0)" for the other combinations. Except when it is the reverse, which happens randomly if you're using the gcc compiler.

The CF3 design added a "swap the prediction" bit in the CCR which the gcc compiler randomly flips around when performing some bit-test-and-branch instructions. That makes the branches take FIVE times as long as they should, depending on what the compiler did and the previous code flow:

https://community.freescale.com/message/404184#404184

> void time_delay(long delay:__D0)

Delay loops are usually a bad idea. It is best to avoid them if you can. Why do you need them? There's usually a few spare PIT or DMA Timers in the CPU that can be set up to free-run at 1MHz (or more) and thus allow precise microsecond delays where required. The other approach (where very short delays are required) are like Linux does with its "Bogomips" counter. It calibrates the delay loop on power up from a hardware timer. This can't work with the CF3 and gcc as loops like yours sometimes take 2 clocks and other times take 6 clocks.

Tom

View solution in original post

TomE · ‎03-24-2015

> I understand the 'Forward' and 'Backward', but I don't konw what the 'Taken' and 'Not Taken' mean.

If the condition is "true" (for "bne" the test was not-equal) then the branch is "taken" and the next instruction is the target of the branch. If the condition is "false" (it was equal) then the branch is "not taken" and the next instruction is the one following the branch.

> what differences between V2 and V4 core in Bcc Instruction Execution Times?

"What" or "Why"? I assume you mean "Why" because you answered the "What" in the tables you included in your post.

Reading the "Core Overview" chapters of representative manuals gives the following differences:

The V1 ColdFire core pipeline stages include the following:

- Two-stage instruction fetch pipeline (IFP) (plus optional instruction buffer stage)

- Two-stage operand execution pipeline (OEP)

The V2 ColdFire core pipeline stages include the following:

- Two-stage instruction fetch pipeline (IFP) (plus optional instruction buffer stage)

- Two-stage operand execution pipeline (OEP)

The V3 ColdFire core pipeline stages include the following:

- Four-stage instruction fetch pipeline (IFP) (plus optional instruction buffer stage)

- Two-stage operand execution pipeline (OEP)

V4 architecture features are defined as follows:

- Two independent, decoupled pipelines—four-stage instruction fetch pipeline (IFP) and five-stage

operand execution pipeline (OEP) for increased performance

- Ten-instruction, FIFO buffer that decouples the IFP and OEP

- Limited superscalar design approaches dual-issue performance with the cost of a scalar execution

pipeline

- Two-level branch acceleration mechanism with a branch cache, plus a prediction table for

increased performance of conditional Bcc instructions

The deeper the pipeline the faster the CPU can run, until you derail it with a conditional branch. Then it has to pick itself up and start all over. A misprediction in the 2+2-stage CF2 costs one extra clock. In the 4+2-stage CF3 it costs an extra 4 clocks. In the 4+5-stage CF4 it costs an extra SEVEN clocks, so they added a Branch Cache to make this less of a problem.

You got the CF4 "Predicted Incorrectly" entry wrong in your table. It is not "1(0/0)". It is "8(0/0)".

> How much clock cycle does 'bne lup2 ' Instruction execution take in MCF52258 and MCF54415

Branch Backwards Taken, so 2 clocks for CF2 and 0 clocks for CF4 once the Branch Cache is loaded. It will take a LOT longer the first time while the cache line holding the instruction sequence gets loaded from memory, maybe 20 to 30 clocks on a CF4.

For reference, the CF3 is worse than either the CF2 or CF4. It takes "1(0/0" for "Forward Not Taken" and "Backward Taken" branches, and "5(0/0)" for the other combinations. Except when it is the reverse, which happens randomly if you're using the gcc compiler.

The CF3 design added a "swap the prediction" bit in the CCR which the gcc compiler randomly flips around when performing some bit-test-and-branch instructions. That makes the branches take FIVE times as long as they should, depending on what the compiler did and the previous code flow:

https://community.freescale.com/message/404184#404184

> void time_delay(long delay:__D0)

Delay loops are usually a bad idea. It is best to avoid them if you can. Why do you need them? There's usually a few spare PIT or DMA Timers in the CPU that can be set up to free-run at 1MHz (or more) and thus allow precise microsecond delays where required. The other approach (where very short delays are required) are like Linux does with its "Bogomips" counter. It calibrates the delay loop on power up from a hardware timer. This can't work with the CF3 and gcc as loops like yours sometimes take 2 clocks and other times take 6 clocks.

Tom

TomE · ‎03-25-2015

I said:

> Branch Backwards Taken, so 2 clocks for CF2 and 0 clocks for CF4 once the Branch Cache is loaded.

Assuming you remember to enable the CF4 Branch Cache. It defaults to being disabled, and needs to be cleared before use as part of initialisation. If you've forgotten to enable this your loops may take 8 times as long.

There is not enough information in the Reference Manual to know what happens if the cache is disabled. There's an 8 entry "Branch cache" backed by a 128 entry "Prediction Table". The Manual doesn't detail what the CACR[BEC] bit actually enables. It says it enables the "Branch Cache", but doesn't say if that includes the "Prediction Table" or not.

Nothing in the manual gives the branch execution times in the case of the Branch Cache being disabled. Does the CPU call back to the CF2 and CF3 "predict backward branches are taken", does it default to predicting not-taken for all branches, or do all branches (forward or backwards, taken or not) take 8 clocks?

There's nothing in the CF4 manuals, nothing in CFPRM and no App Notes that I can find. Searching Freescale's site for "branch prediction table" gets a huge number of hits on the Power chips, so I suspect that to learn about the CF4 Branch Prediction I'd have to read some MPC manuals, as that's probably where the CF4 technology was copied from. For instance the e200z4 core has a Branch Cache, no Prediction Table, but has controllable Static Prediction, so that's not where the CF4 came from. The E300 has no Branch Hardware but uses the "B0" bit in the instruction.

Google finds this from 1998 which suggests the CF4 Branch Cache came from the 68060. It did, but not the Prediction Table.

http://www.freescale.com/files/32bit/doc/eng_bulletin/COLDFIRE4MPR.pdf

And also:

move.l #36,d3 // for 100 MHz MCF54415

lup2: // inner loop

Once in the program cache your delay loop will take 35 clocks to run 35 times on the CF4, then on the 36th one will take EIGHT clocks due to the branch misprediction. That's 23% just to get out of the loop!

Tom

matthey · ‎03-30-2015

Tom Evans wrote:

Google finds this from 1998 which suggests the CF4 Branch Cache came from the 68060. It did, but not the Prediction Table.

http://www.freescale.com/files/32bit/doc/eng_bulletin/COLDFIRE4MPR.pdf

I wonder if it is just a change in terminology. The 68060 has a branch cache and branch history prediction using saturating 2 bit prediction. The article you found infers that the ColdFire and 68060 branch prediction are very similar. The 68060 User Manual is certainly documented different.

http://cache.freescale.com/files/32bit/doc/ref_manual/MC68060UM.pdf

Here is the documentation for the 68060 CACR config bits:

EBC—Enable Branch Cache

0 = The branch cache is disabled and branch cache information is not used in the

branch prediction strategy.

1 = The on-chip branch cache is enabled. Branches are cached. A predicted branch

executes more quickly, and often can be folded onto another instruction.

CABC—Clear All Entries in the Branch Cache

This bit is always read as zero.

0 = No operation is done on the branch cache.

1 = The entire content of the MC68060 branch cache is invalidated.

CUBC—Clear All User Entries in the Branch Cache

This bit is always read as zero.

0 = No operation is performed on the branch cache.

1 = All user-mode entries in the MC68060 branch cache are invalidated; supervisor-

mode branch cache entries remain valid.

The 68060 branch prediction timing chart is more interesting and easier to read than for the CF, IMO.

Instruction	Not Predicted, Forward, Taken	Not Predicted, Forward, Not Taken	Not Predicted, Backward, Taken	Not Predicted, Backward, Not Taken	Predicted Correctly as Taken	Predicted Correctly as Not Taken	Predicted Incorrectly
Bcc	7 (0/0)	1 (0/0)	3 (0/0)	7 (0/0)	0 (0/0)	1 (0/0)	7 (0/0)
BRA	3 (0/0)	-	3 (0/0)	-	0 (0/0)	-	-
BSR	3 (0/1)	-	3 (0/1)	-	1 (0/1)	-	-
DBcc	3 (0/0)	8 (0/0)	3 (0/0)	8 (0/0)	2 (0/0)	2 (0/0)	8 (0/0)
DBRA	3 (0/0)	7 (0/0)	3 (0/0)	7 (0/0)	1 (0/0)	1 (0/0)	7 (0/0)
FBcc	8 (0/0)	2 (0/0)	8 (0/0)	2 (0/0)	2 (0/0)	2 (0/0)	8 (0/0)
JMP (d16,PC)	3 (0/0)	-	3 (0/0)	-	0 (0/0)	-	-
JMP xxx.WL	3 (0/0)	-	3 (0/0)	-	0 (0/0)	-	-
Remaining JMP	5 (0/0)	-	5 (0/0)	-	5 (0/0)	-	-
JSR (d16,PC)	3 (0/1)	-	3 (0/1)	-	1 (0/1)	-	-
JSR xxx.WL	3 (0/1)	-	3 (0/1)	-	1 (0/1)	-	-
Remaining JSR	5 (0/1)	-	5 (0/1)	-	5 (0/1)	-	-

After studying the chart I can conclude with no certainty that "Not Predicted" equals no entry in the branch cache and:

The default static branch prediction of BTFN (Backward Taken Forward Not) is used.

There is no instruction folding so there is a 3 cycle penalty on static predicted Bcc branches which are not in the branch cache.

A static mis-predicted branch is no worse than a branch history table mis-predicted branch.

Also, we might guess that:

FBcc does not use the branch cache or branch prediction.

Only PC relative and absolute JMP and JSR use the branch cache.

Of course any of this could have been retuned for the ColdFire v4 or v5 but my guess is that they are more similar than not (with the pipelines becoming more like the 68060). Joe Circello created a wonderful design on the 68060 so why change what works. It's too bad Motorola didn't realize what they had instead of betting the farm on the green pasture on the other side of the fence called PPC.

TomE · ‎04-03-2015

> It's too bad Motorola didn't realize what they had instead of betting the farm on the green pasture on the other side of the fence called PPC.

Apple had already jumped to PPC by the time the 68060 came out. Apple were competing with Intel-based computers, and were falling behind. Motorola risked Apple jumping all the way to Intel (as if that would ever happen :-), which is why they let someone else take on the huge expense of running the CPU architecture, clock rate, process shrink race and compiler development and support.

.

Motorola did put the 68k architecture to good use in ColdFire.

When was the last ColdFire core development? The V4e dates from 15 years ago in 2000. Freescales "CPU architecture development group" is now ARM, like pretty much everyone else.

Tom

matthey · ‎04-06-2015

Tom Evans wrote:

> It's too bad Motorola didn't realize what they had instead of betting the farm on the green pasture on the other side of the fence called PPC.

Apple had already jumped to PPC by the time the 68060 came out. Apple were competing with Intel-based computers, and were falling behind. Motorola risked Apple jumping all the way to Intel (as if that would ever happen :-), which is why they let someone else take on the huge expense of running the CPU architecture, clock rate, process shrink race and compiler development and support.

Motorola did put the 68k architecture to good use in ColdFire.

When was the last ColdFire core development? The V4e dates from 15 years ago in 2000. Freescales "CPU architecture development group" is now ARM, like pretty much everyone else.

Motorola was telling customers that the 68k was end of line and the future was RISC and PPC (it's not surprising what customers decided to do). The idea was that RISC would clock higher and hardware complexity could be moved into the compiler. These proved to be no advantage as RISC needs to clock significantly higher to be competitive and compilers are limited in the assumptions they can make. Motorola had trouble clocking up the PPC also. I wouldn't be surprised if they deliberately decided not to clock the 68060 up because they didn't want it to compete with PPC (rev 6 68060@50MHz overclock to 100MHz most of the time). Amigas with 68060 accelerators and Macintosh emulators were the fastest Macintoshs for awhile. Of course the 68060 couldn't survive with anti-marketing and no improvements.

The ColdFire was weakened considerably compared to the 68060. The performance/MHz dropped dramatically and was never restored even with the v5 CF and code density deteriorated. A superscalar FPGA 68k CPU (in Altera Cyclone V) should soon be outperforming the fastest 68060 and the fastest v4 ColdFire CPU. I believe code density can be improved 10%-25% over CF code and better than Thumb 2. Instead of organically developing a competitive 68k/CF, Freescale (or NXP Semiconductors now) will be licensing ARM technology. Sorry, I can't agree that the cut down ColdFire is good use of the 68k architecture.

TomE · ‎04-12-2015

> A superscalar FPGA 68k CPU (in Altera Cyclone V)

Would necessarily run at half the clock rate and double the power of real silicon. Good reference here:

http://boinc.berkeley.edu/Thesis_Eastlack_Nov09.pdf

> I believe code density can be improved 10%-25% over CF code

I looked into part of this more than a decade ago. I needed to know how to set up GCC to compile for a CPU32 which didn't have some of the fancier 68020 addressing modes. It didn't matter as internally gcc was incapable of representing those addressing modes, so it never used them. That was part of the justification for removing them from the CF. That and them not being faster (or much faster) than the equivalent simple instructions. I suspect the same would still apply. So unless you write a compiler specifically for the 680x0 instructions they're just wasting transistors.

> Freescale (or NXP Semiconductors now) will be licensing ARM technology.

"Will"? They have been doing that for over 14 YEARS, starting with the i.MX1 in 2001 and then the MAC7100 in 2003 or 2004.

> Sorry, I can't agree that the cut down ColdFire is good use of the 68k architecture.

I don't consider "dominance of the desktop, needing a huge heatsink" to be the pinnacle of success. The PPC had a nice long run in Macintoshes, but there's a lot of them running in embedded devices. There are multiple MCP55xx series for Automotive use, one of which we use.. Likewise, the MCF in all the versions makes a very nice embedded controller. We've used the MCF5329 in an Automotive product, and it runs very nicely at 240MHz. Without a heatsink too.

Tom

matthey · ‎04-14-2015

Tom Evans wrote:

> A superscalar FPGA 68k CPU (in Altera Cyclone V)

Would necessarily run at half the clock rate and double the power of real silicon. Good reference here:

http://boinc.berkeley.edu/Thesis_Eastlack_Nov09.pdf

I'm aware of the limitations of FPGA technology. I have worked with the group making the Apollo/Phoenix 68k FPGA.

APOLLO - High Performance Processor

This FPGA superscalar 68k CPU has a clock speed of ~80MHz in an 8000LE Altera Cyclone II FPGA where it is very cramped but still manages 68060 level performance. A higher performance 3 integer unit version of the core in a Cyclone V is being tested and optimized but at 100-150MHz is already giving about 3 times the performance. This is better performance than the fastest ColdFire v4. An FPGA is limited in clock speed but parallel processing and taking advantage of memory bandwidth can give a lower clocked processor with performance better than some hard processors. Our 68k code analysis found that the 68k has short enough instructions (~3 bytes/instruction) to make 3 integer units worthwhile and this can be improved with ISA changes. It may be possible to dual port the cache memory allowing for 2 cache reads per cycle (takes advantage of excess FPGA memory bandwidth) which I believe would give significantly better performance as this is the limitation in most processors. CPU performance in an FPGA is not a problem for embedded uses where lower clock speed is an advantage. Power efficiency would not be as good as a hard processor because of leakage but there are relatively low cost options to improve this like the eASIC. An ARM CPU probably makes sense where power efficiency is more important than performance. A high performance enhanced 68k has the potential to be a better Atom processor. Instructions sizes, code density and instruction decoding are better than the x86 and it is not necessary to add more registers like x86_64 to reduce cache accesses.

Tom Evans wrote:

> I believe code density can be improved 10%-25% over CF code

I looked into part of this more than a decade ago. I needed to know how to set up GCC to compile for a CPU32 which didn't have some of the fancier 68020 addressing modes. It didn't matter as internally gcc was incapable of representing those addressing modes, so it never used them. That was part of the justification for removing them from the CF. That and them not being faster (or much faster) than the equivalent simple instructions. I suspect the same would still apply. So unless you write a compiler specifically for the 680x0 instructions they're just wasting transistors.

I've analyzed a lot of 68k code (I improved and modified a "smart" 68k disassembler to produce statistics) including many different versions of GCC. GCC has used the double indirect addressing modes for a long time (at least since GCC 3.x where these addressing modes can be found in the compiler's executables). These addressing modes are useful for object oriented code like C++ and sometimes save a register and improve code density. They are challenging to execute quickly without OoO execution though. A superscalar in order processor can give faster code in cases where these addressing modes can be split into separate instructions and re-scheduled.

The 68060 and ColdFire dropped the 32*32=64 hardware integer multiplication which GCC was also using for optimizations despite Motorola deciding it wasn't necessary (it was useful enough for most ARM processors to add though). GCC since 2.x has turned 32 bit division by a constant into a multiplication which requires the high 32 bits of the product. I'm actually working on similar functionality I hope can be used by vbcc for 32 bit and 16 bit divisions (GCC does not optimize 16 bit integer divisions which should work with the 68k and CF 16*16=32). Advanced compilers use LEA optimizations as you pointed out in the other thread even though this is challenging with the 68k where there is a lack of orthogonality. The 68000 used a split register file while all following (and future) 68k processors use a monolithic register file which allows opening up address register sources. This reduces the number of instructions improving performance, improves code density and simplifies compiler code generation (OoO processors could open up address register destinations but this can give a load use bubble without it and there isn't as much of an advantage). There is a good place to encode LEA EA,Dn which would improve orthogonality. It's possible to add a very simple but very powerful addressing mode which would do the same as LEA but could do an ALU calculation in the same pipe in the same cycle (up to 5 operations per pipe per cycle but with load use bubbles without OoO). Hardware designers seem to ignore compiler designers where good communication is crucial. Motorola was reducing performance while getting rid of functionality which was added a while later to lower end ARM processors.

Here are some of the ideas we came up with for an enhanced 68k ISA.

http://www.heywheel.com/matthey/Amiga/68kF_PRM.pdf

I wanted to create a new compatible open 68k+CF ISA standard which could also be used for other FPGA 68k processors and emulators but Gunnar von Boehn added his own radical ISA enhancements which are not documented (although using some of the ideas from the link above).

Tom Evans wrote:

> Freescale (or NXP Semiconductors now) will be licensing ARM technology.

"Will"? They have been doing that for over 14 YEARS, starting with the i.MX1 in 2001 and then the MAC7100 in 2003 or 2004.

Sure. I didn't try to make it sound like a new thing but rather that they have no problem, up to now, with paying their competitors when Motorola was once one of the leading processor innovators. Now Motorola is a Chinese company and Freescale will be a Dutch company. It's sad to see but they ignored their own organic technology innovations and didn't listen enough to their customers.

Tom Evans wrote:

> Sorry, I can't agree that the cut down ColdFire is good use of the 68k architecture.

I don't consider "dominance of the desktop, needing a huge heatsink" to be the pinnacle of success. The PPC had a nice long run in Macintoshes, but there's a lot of them running in embedded devices. There are multiple MCP55xx series for Automotive use, one of which we use.. Likewise, the MCF in all the versions makes a very nice embedded controller. We've used the MCF5329 in an Automotive product, and it runs very nicely at 240MHz. Without a heatsink too.

My overclocked 68060@75MHz runs very cool. I doubt it would need a fan at all. The 68060 was designed to be usable in a laptop after the 68040 was one of the hottest processors of all time. It's a shame that Apple didn't create a 68060 Mac laptop where efficient resource use is more of an advantage.

I don't have anything against the PPC. It has some innovative features although the general thinking at the time of ISA creation was that hardware complexity could be moved into the compiler which failed. Instruction acronyms got out of hand also. I just love working with the 68k because of how readable the code is. Of course PPC and 68k/CF are dying and nearly done due to marketing reasons and lack of development. There really isn't much difference between PPC and ARM v8 which will likely replace PPC. I would rather stay with PPC considering how little difference there is and the work necessary to support a new ISA. The PPC backend of vbcc is better than for any other processor. Volker Barthelmann has experience in automotive embedded and created vbcc with that in mind. It supports MISRA C for example. I'm not sure the ARM backend even works in comparison despite the "embedded" focus.

Bcc Instruction Execution Times

Bcc Instruction Execution Times

General