<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Bcc Instruction Execution Times in ColdFire/68K Microcontrollers and Processors</title>
    <link>https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Bcc-Instruction-Execution-Times/m-p/344489#M12256</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;BLOCKQUOTE&gt;
&lt;P&gt;Tom Evans wrote:&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;gt; A superscalar FPGA 68k CPU (in Altera Cyclone V)&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Would necessarily run at half the clock rate and double the power of real silicon. Good reference here:&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;A class="jive-link-external-small" data-content-finding="Community" href="http://boinc.berkeley.edu/Thesis_Eastlack_Nov09.pdf" onclick="" rel="nofollow noopener noreferrer" target="_blank"&gt;http://boinc.berkeley.edu/Thesis_Eastlack_Nov09.pdf&lt;/A&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'm aware of the limitations of FPGA technology. I have worked with the group making the Apollo/Phoenix 68k FPGA.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="http://apollo-core.com/" rel="nofollow noopener noreferrer" title="http://apollo-core.com/" target="_blank"&gt;APOLLO - High Performance Processor&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This FPGA superscalar 68k CPU has a clock speed of ~80MHz in an 8000LE Altera Cyclone II FPGA where it is very cramped but still manages 68060 level performance. A higher performance 3 integer unit version of the core in a Cyclone V is being tested and optimized but at 100-150MHz is already giving about 3 times the performance. This is better performance than the fastest ColdFire v4. An FPGA is limited in clock speed but parallel processing and taking advantage of memory bandwidth can give a lower clocked processor with performance better than some hard processors. Our 68k code analysis found that the 68k has short enough instructions (~3 bytes/instruction) to make 3 integer units worthwhile and this can be improved with ISA changes. It may be possible to dual port the cache memory allowing for 2 cache reads per cycle (takes advantage of excess FPGA memory bandwidth) which I believe would give significantly better performance as this is the limitation in most processors. CPU performance in an FPGA is not a problem for embedded uses where lower clock speed is an advantage. Power efficiency would not be as good as a hard processor because of leakage but there are relatively low cost options to improve this like the eASIC. An ARM CPU probably makes sense where power efficiency is more important than performance. A high performance enhanced 68k has the potential to be a better Atom processor. Instructions sizes, code density and instruction decoding are better than the x86 and it is not necessary to add more registers like x86_64 to reduce cache accesses.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE class="jive_text_macro jive_macro_quote"&gt;
&lt;P&gt;Tom Evans wrote:&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;gt; I believe code density can be improved 10%-25% over CF code&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I looked into part of this more than a decade ago. I needed to know how to set up GCC to compile for a CPU32 which didn't have some of the fancier 68020 addressing modes. It didn't matter as internally gcc was incapable of representing those addressing modes, so it never used them. That was part of the justification for removing them from the CF. That and them not being faster (or much faster) than the equivalent simple instructions. I suspect the same would still apply. So unless you write a compiler specifically for the 680x0 instructions they're just wasting transistors.&lt;/P&gt;
&lt;/PRE&gt;&lt;P&gt;&lt;BR /&gt;I've analyzed a lot of 68k code (I improved and modified a "smart" 68k disassembler to produce statistics) including many different versions of GCC. GCC has used the double indirect addressing modes for a long time (at least since GCC 3.x where these addressing modes can be found in the compiler's executables). These addressing modes are useful for object oriented code like C++ and sometimes save a register and improve code density. They are challenging to execute quickly without OoO execution though. A superscalar in order processor can give faster code in cases where these addressing modes can be split into separate instructions and re-scheduled.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The 68060 and ColdFire dropped the 32*32=64 hardware integer multiplication which GCC was also using for optimizations despite Motorola deciding it wasn't necessary (it was useful enough for most ARM processors to add though). GCC since 2.x has turned 32 bit division by a constant into a multiplication which requires the high 32 bits of the product. I'm actually working on similar functionality I hope can be used by vbcc for 32 bit and 16 bit divisions (GCC does not optimize 16 bit integer divisions which should work with the 68k and CF 16*16=32). Advanced compilers use LEA optimizations as you pointed out in the other thread even though this is challenging with the 68k where there is a lack of orthogonality. The 68000 used a split register file while all following (and future) 68k processors use a monolithic register file which allows opening up address register sources. This reduces the number of instructions improving performance, improves code density and simplifies compiler code generation (OoO processors could open up address register destinations but this can give a load use bubble without it and there isn't as much of an advantage). There is a good place to encode LEA EA,Dn which would improve orthogonality. It's possible to add a very simple but very powerful addressing mode which would do the same as LEA but could do an ALU calculation in the same pipe in the same cycle (up to 5 operations per pipe per cycle but with load use bubbles without OoO). Hardware designers seem to ignore compiler designers where good communication is crucial. Motorola was reducing performance while getting rid of functionality which was added a while later to lower end ARM processors.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Here are some of the ideas we came up with for an enhanced 68k ISA.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="http://www.heywheel.com/matthey/Amiga/68kF_PRM.pdf" rel="nofollow noopener noreferrer" title="http://www.heywheel.com/matthey/Amiga/68kF_PRM.pdf" target="_blank"&gt;http://www.heywheel.com/matthey/Amiga/68kF_PRM.pdf&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I wanted to create a new compatible open 68k+CF ISA standard which could also be used for other FPGA 68k processors and emulators but Gunnar von Boehn added his own radical ISA enhancements which are not documented (although using some of the ideas from the link above).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE class="jive_text_macro jive_macro_quote"&gt;
&lt;P&gt;Tom Evans wrote:&lt;/P&gt;
&lt;P&gt;&amp;gt; Freescale (or NXP Semiconductors now) will be licensing ARM technology.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;"Will"? They have been doing that for over 14 YEARS, starting with the i.MX1 in 2001 and then the MAC7100 in 2003 or 2004.&lt;/P&gt;
&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Sure. I didn't try to make it sound like a new thing but rather that they have no problem, up to now, with paying their competitors when Motorola was once one of the leading processor innovators. Now Motorola is a Chinese company and Freescale will be a Dutch company. It's sad to see but they ignored their own organic technology innovations and didn't listen enough to their customers.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE class="jive_text_macro jive_macro_quote"&gt;
&lt;P&gt;Tom Evans wrote:&lt;/P&gt;
&lt;P&gt;&amp;gt; Sorry, I can't agree that the cut down ColdFire is good use of the 68k architecture.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I don't consider "dominance of the desktop, needing a huge heatsink" to be the pinnacle of success. The PPC had a nice long run in Macintoshes, but there's a lot of them running in embedded devices. There are multiple MCP55xx series for Automotive use, one of which we use.. Likewise, the MCF in all the versions makes a very nice embedded controller. We've used the MCF5329 in an Automotive product, and it runs very nicely at 240MHz. Without a heatsink too.&lt;/P&gt;
&lt;/PRE&gt;&lt;P&gt;&lt;BR /&gt;My overclocked 68060@75MHz runs very cool. I doubt it would need a fan at all. The 68060 was designed to be usable in a laptop after the 68040 was one of the hottest processors of all time. It's a shame that Apple didn't create a 68060 Mac laptop where efficient resource use is more of an advantage.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I don't have anything against the PPC. It has some innovative features although the general thinking at the time of ISA creation was that hardware complexity could be moved into the compiler which failed. Instruction acronyms got out of hand also. I just love working with the 68k because of how readable the code is. Of course PPC and 68k/CF are dying and nearly done due to marketing reasons and lack of development. There really isn't much difference between PPC and ARM v8 which will likely replace PPC. I would rather stay with PPC considering how little difference there is and the work necessary to support a new ISA. The PPC backend of vbcc is better than for any other processor. Volker Barthelmann has experience in automotive embedded and created vbcc with that in mind. It supports MISRA C for example. I'm not sure the ARM backend even works in comparison despite the "embedded" focus.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Tue, 14 Apr 2015 22:36:44 GMT</pubDate>
    <dc:creator>matthey</dc:creator>
    <dc:date>2015-04-14T22:36:44Z</dc:date>
    <item>
      <title>Bcc Instruction Execution Times</title>
      <link>https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Bcc-Instruction-Execution-Times/m-p/344482#M12249</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;I want to know Bcc Instruction Execution Times in MCF52258. So I &lt;SPAN style="font-size: 10pt; line-height: 1.5em;"&gt;referred to MCF52259RM Rev4 and ColdFire Family Programmer's Reference Manual.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Table 3-19. Bcc Instruction Execution Times&lt;/P&gt;&lt;DIV class="j-rte-table"&gt;&lt;TABLE border="1" class="jiveBorder" height="69" style="border: 1px solid rgb(0, 0, 0); width: 386px; height: 30px;"&gt;&lt;THEAD&gt;&lt;TR&gt;&lt;TH style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;color: #ffffff;background-color: #6690bc;text-align: center;" valign="middle"&gt;Opcode&lt;/TH&gt;&lt;TH style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;color: #ffffff;background-color: #6690bc;text-align: center;" valign="middle"&gt;&lt;P&gt;Forward&lt;/P&gt;&lt;P&gt;Taken&lt;/P&gt;&lt;/TH&gt;&lt;TH style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;color: #ffffff;background-color: #6690bc;text-align: center;" valign="middle"&gt;&lt;P&gt;Forward&lt;/P&gt;&lt;P&gt;Not Taken&lt;/P&gt;&lt;/TH&gt;&lt;TH style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;color: #ffffff;background-color: #6690bc;text-align: center;" valign="middle"&gt;&lt;P&gt;Backward&lt;/P&gt;&lt;P&gt;Taken&lt;/P&gt;&lt;/TH&gt;&lt;TH style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;color: #ffffff;background-color: #6690bc;text-align: center;" valign="middle"&gt;&lt;P&gt;Backward&lt;/P&gt;&lt;P&gt;Not Taken&lt;/P&gt;&lt;/TH&gt;&lt;/TR&gt;&lt;/THEAD&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;"&gt;Bcc &lt;/TD&gt;&lt;TD style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;"&gt;3(0/0) &lt;/TD&gt;&lt;TD style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;"&gt;1(0/0) &lt;/TD&gt;&lt;TD style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;"&gt;2(0/0)&lt;/TD&gt;&lt;TD style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;"&gt;3(0/0)&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;/DIV&gt;&lt;P style="min-height: 8pt; padding: 0px;"&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I understand the 'Forward' and 'Backward', but I don't konw what the 'Taken' and 'Not Taken' mean.&lt;/P&gt;&lt;P style="min-height: 8pt; padding: 0px;"&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;By the way, what differences between V2 and V4 core in Bcc Instruction Execution Times?&lt;/P&gt;&lt;P style="min-height: 8pt; padding: 0px;"&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class="j-rte-table"&gt;&lt;TABLE border="1" class="jiveBorder" height="114" style="border: 1px solid rgb(0, 0, 0); width: 463px; height: 32px;"&gt;&lt;THEAD&gt;&lt;TR&gt;&lt;TH style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;color: #ffffff;background-color: #6690bc;text-align: center;" valign="middle"&gt;Opcode&lt;/TH&gt;&lt;TH style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;color: #ffffff;background-color: #6690bc;text-align: center;" valign="middle"&gt;&lt;P&gt;Branch Cache&lt;/P&gt;&lt;P&gt;Correctly Predicts&lt;/P&gt;&lt;P&gt;Taken&lt;/P&gt;&lt;/TH&gt;&lt;TH style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;color: #ffffff;background-color: #6690bc;text-align: center;" valign="middle"&gt;&lt;P&gt;Prediction Table&lt;/P&gt;&lt;P&gt;Correctly Predicts&lt;/P&gt;&lt;P&gt;Taken&lt;/P&gt;&lt;/TH&gt;&lt;TH style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;color: #ffffff;background-color: #6690bc;text-align: center;" valign="middle"&gt;&lt;P&gt;Predicted&lt;/P&gt;&lt;P&gt;Correctly as Not&lt;/P&gt;&lt;P&gt;Taken&lt;/P&gt;&lt;/TH&gt;&lt;TH style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;color: #ffffff;background-color: #6690bc;text-align: center;" valign="middle"&gt;&lt;P&gt;Predicted&lt;/P&gt;&lt;P&gt;Incorrectly&lt;/P&gt;&lt;/TH&gt;&lt;/TR&gt;&lt;/THEAD&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;"&gt;Bcc&lt;/TD&gt;&lt;TD style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;"&gt;0(0/0)&lt;/TD&gt;&lt;TD style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;"&gt;1(0/0) &lt;/TD&gt;&lt;TD style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;"&gt;1(0/0) &lt;/TD&gt;&lt;TD style="border:1px solid black;border: 1px solid rgb(0, 0, 0);padding: 2px;"&gt;1(0/0) &lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;/DIV&gt;&lt;P style="min-height: 8pt; padding: 0px;"&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;There is a code snippet, How much clock cycle does 'bne lup2 ' Instruction execution take in MCF52258 and MCF54415?&lt;/P&gt;&lt;P style="min-height: 8pt; padding: 0px;"&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;void time_delay(long delay:__D0)&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; asm&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; move.l&amp;nbsp;&amp;nbsp; d3,-(a7)&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; // save D3 on stack&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; lup1:&amp;nbsp; // outer loop&lt;/P&gt;&lt;P&gt;//&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; move.l&amp;nbsp; #17,d3&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; // for 75 MHz MCF5232&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; move.l&amp;nbsp; #36,d3&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; // for 100 MHz MCF54415&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; lup2:&amp;nbsp; // inner loop&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; subi #1,d3&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; bne lup2&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; subi #1,d0&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; bne lup1&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; move.l&amp;nbsp;&amp;nbsp; (a7)+,d3&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; // restore D3&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&amp;nbsp;&amp;nbsp; &lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Tue, 24 Mar 2015 08:40:46 GMT</pubDate>
      <guid>https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Bcc-Instruction-Execution-Times/m-p/344482#M12249</guid>
      <dc:creator>leocheng</dc:creator>
      <dc:date>2015-03-24T08:40:46Z</dc:date>
    </item>
    <item>
      <title>Re: Bcc Instruction Execution Times</title>
      <link>https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Bcc-Instruction-Execution-Times/m-p/344483#M12250</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;&amp;gt; I understand the 'Forward' and 'Backward', but I don't konw what the 'Taken' and 'Not Taken' mean.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If the condition is "true" (for "bne" the test was not-equal) then the branch is "taken" and the next instruction is the target of the branch. If the condition is "false" (it was equal) then the branch is "not taken" and the next instruction is the one following the branch.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&amp;gt; what differences between V2 and V4 core in Bcc Instruction Execution Times?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;"What" or "Why"? I assume you mean "Why" because you answered the "What" in the tables you included in your post.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Reading the "Core Overview" chapters of representative manuals gives the following differences:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;The V1 ColdFire core pipeline stages include the following:&lt;/SPAN&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;- &lt;STRONG&gt;Two-stage instruction fetch pipeline (IFP)&lt;/STRONG&gt; (plus optional instruction buffer stage)&lt;/SPAN&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;- &lt;STRONG&gt;Two-stage operand execution pipeline (OEP)&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;The V2 ColdFire core pipeline stages include the following:&lt;/SPAN&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;- &lt;STRONG&gt;Two-stage instruction fetch pipeline (IFP)&lt;/STRONG&gt; (plus optional instruction buffer stage)&lt;/SPAN&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;- &lt;STRONG&gt;Two-stage operand execution pipeline (OEP)&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;The V3 ColdFire core pipeline stages include the following:&lt;/SPAN&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;- &lt;STRONG&gt;Four-stage instruction fetch pipeline (IFP)&lt;/STRONG&gt; (plus optional instruction buffer stage)&lt;/SPAN&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;- &lt;STRONG&gt;Two-stage operand execution pipeline (OEP)&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;V4 architecture features are defined as follows:&lt;/SPAN&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;- Two independent, decoupled pipelines—&lt;STRONG&gt;four-stage instruction fetch pipeline (IFP) and five-stage&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;operand execution pipeline (OEP)&lt;/STRONG&gt; for increased performance&lt;/SPAN&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;- Ten-instruction, FIFO buffer that decouples the IFP and OEP&lt;/SPAN&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;- Limited superscalar design approaches dual-issue performance with the cost of a scalar execution&lt;/SPAN&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;pipeline&lt;/SPAN&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;- Two-level branch acceleration mechanism with a branch cache, plus a prediction table for&lt;/SPAN&gt;&lt;/P&gt;&lt;P style="padding-left: 30px;"&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;increased performance of conditional Bcc instructions&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 10pt; font-family: arial,helvetica,sans-serif;"&gt;The deeper the pipeline the faster the CPU can run, until you derail it with a conditional branch. Then it has to pick itself up and start all over. A misprediction in the&amp;nbsp; 2+2-stage CF2 costs one extra clock. In the 4+2-stage CF3 it costs an extra 4 clocks. In the 4+5-stage CF4 it costs an extra &lt;STRONG style="color: #ff0000;"&gt;SEVEN&lt;/STRONG&gt; clocks, so they added a Branch Cache to make this less of a problem.&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 10pt; font-family: arial,helvetica,sans-serif;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 10pt; font-family: arial,helvetica,sans-serif;"&gt;You got the CF4 "Predicted Incorrectly" entry wrong in your table. It is not "1(0/0)". It is "&lt;SPAN style="color: #ff0000;"&gt;&lt;STRONG&gt;8&lt;/STRONG&gt;&lt;/SPAN&gt;(0/0)".&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;&amp;gt; How much clock cycle does 'bne lup2 ' Instruction execution take in MCF52258 and MCF54415&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;Branch Backwards Taken, so 2 clocks for CF2 and 0 clocks for CF4 once the Branch Cache is loaded. It will take a LOT longer the first time while the cache line holding the instruction sequence gets loaded from memory, maybe 20 to 30 clocks on a CF4.&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;For reference, the CF3 is worse than either the CF2 or CF4. It takes "1(0/0" for "Forward Not Taken" and "Backward Taken" branches, and "5(0/0)" for the other combinations. Except when it is the reverse, which happens randomly if you're using the gcc compiler.&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;The CF3 design added a "swap the prediction" bit in the CCR which the gcc compiler randomly flips around when performing some bit-test-and-branch instructions. That makes the branches take FIVE times as long as they should, depending on what the compiler did and the previous code flow:&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;A _jive_internal="true" href="https://community.nxp.com/message/404184#404184" title="https://community.freescale.com/message/404184#404184"&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif;"&gt;&lt;SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;A href="https://community.freescale.com/message/404184#404184" target="test_blank"&gt;https://community.freescale.com/message/404184#404184&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;gt; void time_delay(long delay:__D0)&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 10pt; font-family: arial,helvetica,sans-serif;"&gt;Delay loops are usually a bad idea. It is best to avoid them if you can. Why do you need them? There's usually a few spare PIT or DMA Timers in the CPU that can be set up to free-run at 1MHz (or more) and thus allow precise microsecond delays where required. The other approach (where very short delays are required) are like Linux does with its "Bogomips" counter. It calibrates the delay loop on power up from a hardware timer. This can't work with the CF3 and gcc as loops like yours sometimes take 2 clocks and other times take 6 clocks.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 10pt; font-family: arial,helvetica,sans-serif;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 10pt; font-family: arial,helvetica,sans-serif;"&gt;Tom&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 10pt; font-family: arial,helvetica,sans-serif;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 25 Mar 2015 00:26:57 GMT</pubDate>
      <guid>https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Bcc-Instruction-Execution-Times/m-p/344483#M12250</guid>
      <dc:creator>TomE</dc:creator>
      <dc:date>2015-03-25T00:26:57Z</dc:date>
    </item>
    <item>
      <title>Re: Bcc Instruction Execution Times</title>
      <link>https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Bcc-Instruction-Execution-Times/m-p/344484#M12251</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;I said:&lt;/P&gt;&lt;P&gt;&amp;gt; &lt;SPAN style="font-family: arial,helvetica,sans-serif; font-size: 10pt;"&gt;Branch Backwards Taken, so 2 clocks for CF2 and 0 clocks for CF4 once the Branch Cache is loaded.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Assuming you remember to enable the CF4 Branch Cache. It defaults to being disabled, and needs to be cleared before use as part of initialisation. If you've forgotten to enable this your loops may take 8 times as long.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;There is not enough information in the Reference Manual to know what happens if the cache is disabled. There's an 8 entry "Branch cache" backed by a 128 entry "Prediction Table". The Manual doesn't detail what the CACR[BEC] bit actually enables. It says it enables the "Branch Cache", but doesn't say if that includes the "Prediction Table" or not.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Nothing in the manual gives the branch execution times in the case of the Branch Cache being disabled. Does the CPU call back to the CF2 and CF3 "predict backward branches are taken", does it default to predicting not-taken for all branches, or do all branches (forward or backwards, taken or not) take 8 clocks?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;There's nothing in the CF4 manuals, nothing in CFPRM and no App Notes that I can find. Searching Freescale's site for "branch prediction table" gets a huge number of hits on the Power chips, so I suspect that to learn about the CF4 Branch Prediction I'd have to read some MPC manuals, as that's probably where the CF4 technology was copied from. For instance the e200z4 core has a Branch Cache, no Prediction Table, but has controllable Static Prediction, so that's not where the CF4 came from. The E300 has no Branch Hardware but uses the "B0" bit in the instruction.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Google finds this from 1998 which suggests the CF4 Branch Cache came from the 68060. It did, but not the Prediction Table.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="http://www.freescale.com/files/32bit/doc/eng_bulletin/COLDFIRE4MPR.pdf" title="http://www.freescale.com/files/32bit/doc/eng_bulletin/COLDFIRE4MPR.pdf"&gt;http://www.freescale.com/files/32bit/doc/eng_bulletin/COLDFIRE4MPR.pdf&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;And also:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; move.l&amp;nbsp; #36,d3&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; // for 100 MHz MCF54415&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; lup2:&amp;nbsp; // inner loop&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Once in the program cache your delay loop will take 35 clocks to run 35 times on the CF4, then on the 36th one will take EIGHT clocks due to the branch misprediction. That's 23% just to get out of the loop!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Tom&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 25 Mar 2015 23:32:50 GMT</pubDate>
      <guid>https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Bcc-Instruction-Execution-Times/m-p/344484#M12251</guid>
      <dc:creator>TomE</dc:creator>
      <dc:date>2015-03-25T23:32:50Z</dc:date>
    </item>
    <item>
      <title>Re: Re: Bcc Instruction Execution Times</title>
      <link>https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Bcc-Instruction-Execution-Times/m-p/344485#M12252</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;BLOCKQUOTE&gt;
&lt;P&gt;Tom Evans wrote:&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Google finds this from 1998 which suggests the CF4 Branch Cache came from the 68060. It did, but not the Prediction Table.&lt;/P&gt;
&lt;P style="min-height: 8pt; height: 8pt; padding: 0px;"&gt;&lt;/P&gt;
&lt;P&gt;&lt;A class="jive-link-external-small" data-content-finding="Community" href="http://www.freescale.com/files/32bit/doc/eng_bulletin/COLDFIRE4MPR.pdf" onclick="" target="_blank"&gt;http://www.freescale.com/files/32bit/doc/eng_bulletin/COLDFIRE4MPR.pdf&lt;/A&gt;&lt;/P&gt;


&lt;/BLOCKQUOTE&gt;&lt;P&gt;I wonder if it is just a change in terminology. The 68060 has a branch cache and branch history prediction using saturating 2 bit prediction. The article you found infers that the ColdFire and 68060 branch prediction are very similar. The 68060 User Manual is certainly documented different.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="http://cache.freescale.com/files/32bit/doc/ref_manual/MC68060UM.pdf" title="http://cache.freescale.com/files/32bit/doc/ref_manual/MC68060UM.pdf"&gt;http://cache.freescale.com/files/32bit/doc/ref_manual/MC68060UM.pdf&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Here is the documentation for the 68060 CACR config bits:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P data-canvas-width="255.6600000000001" style="font-size: 20px; font-family: sans-serif;"&gt;&lt;SPAN style="font-size: 10pt;"&gt;EBC—Enable Branch Cache&lt;/SPAN&gt;&lt;/P&gt;&lt;P data-canvas-width="731.4999999999998" style="font-family: sans-serif;"&gt;&lt;SPAN style="font-size: 10pt;"&gt;0 = The branch cache is disabled and branch cache information is not used in the &lt;/SPAN&gt;&lt;/P&gt;&lt;P data-canvas-width="235.66000000000005" style="font-family: sans-serif;"&gt;&lt;SPAN style="font-size: 10pt;"&gt;&lt;SPAN class="selected highlight"&gt;branch prediction&lt;/SPAN&gt; strategy.&lt;/SPAN&gt;&lt;/P&gt;&lt;P data-canvas-width="752.6199999999998" style="font-family: sans-serif;"&gt;&lt;SPAN style="font-size: 10pt;"&gt;1 = The on-chip branch cache is enabled. Branches are cached. A predicted branch &lt;/SPAN&gt;&lt;/P&gt;&lt;P data-canvas-width="638.1200000000001" style="font-family: sans-serif;"&gt;&lt;SPAN style="font-size: 10pt;"&gt;executes more quickly, and often can be folded onto another instruction.&lt;/SPAN&gt;&lt;/P&gt;&lt;P data-canvas-width="638.1200000000001" style="font-family: sans-serif;"&gt;&lt;SPAN style="font-size: 10pt;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P data-canvas-width="405.6800000000002" style="font-family: sans-serif;"&gt;&lt;SPAN style="font-size: 10pt;"&gt;CABC—Clear All Entries in the Branch Cache&lt;/SPAN&gt;&lt;/P&gt;&lt;P data-canvas-width="273.42" style="font-family: sans-serif;"&gt;&lt;SPAN style="font-size: 10pt;"&gt;This bit is always read as zero.&lt;/SPAN&gt;&lt;/P&gt;&lt;P data-canvas-width="416.92" style="font-family: sans-serif;"&gt;&lt;SPAN style="font-size: 10pt;"&gt;0 = No operation is done on the branch cache.&lt;/SPAN&gt;&lt;/P&gt;&lt;P data-canvas-width="601.4400000000002" style="font-family: sans-serif;"&gt;&lt;SPAN style="font-size: 10pt;"&gt;1 = The entire content of the MC68060 branch cache is invalidated.&lt;/SPAN&gt;&lt;/P&gt;&lt;P data-canvas-width="454.5600000000001" style="font-family: sans-serif;"&gt;&lt;SPAN style="font-size: 10pt;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P data-canvas-width="454.5600000000001" style="font-family: sans-serif;"&gt;&lt;SPAN style="font-size: 10pt;"&gt;CUBC—Clear All User Entries in the Branch Cache&lt;/SPAN&gt;&lt;/P&gt;&lt;P data-canvas-width="273.42" style="font-family: sans-serif;"&gt;&lt;SPAN style="font-size: 10pt;"&gt;This bit is always read as zero.&lt;/SPAN&gt;&lt;/P&gt;&lt;P data-canvas-width="463.5800000000001" style="font-family: sans-serif;"&gt;&lt;SPAN style="font-size: 10pt;"&gt;0 = No operation is performed on the branch cache.&lt;/SPAN&gt;&lt;/P&gt;&lt;P data-canvas-width="751.4200000000001" style="font-family: sans-serif;"&gt;&lt;SPAN style="font-size: 10pt;"&gt;1 = All user-mode entries in the MC68060 branch cache are invalidated; supervisor-&lt;/SPAN&gt;&lt;/P&gt;&lt;P data-canvas-width="360.14000000000004" style="font-family: sans-serif;"&gt;&lt;SPAN style="font-size: 10pt;"&gt;mode branch cache entries remain valid.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The 68060 branch prediction timing chart is more interesting and easier to read than for the CF, IMO.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;TABLE border="1" class="jiveBorder" style="border: 1px solid #000000; width: 100%;"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TH style="text-align: center; background-color: #6690bc; color: #ffffff; padding: 2px;" valign="middle"&gt;&lt;P&gt;&lt;STRONG&gt;&lt;BR /&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;BR /&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Instruction&lt;/STRONG&gt;&lt;/P&gt;&lt;/TH&gt;&lt;TH style="text-align: center; background-color: #6690bc; color: #ffffff; padding: 2px;" valign="middle"&gt;&lt;P data-canvas-width="29.159999999999997" style="font-size: 15px; font-family: sans-serif;"&gt;Not&lt;/P&gt;&lt;P data-canvas-width="76.69500000000001" style="font-size: 15px; font-family: sans-serif;"&gt;Predicted,&lt;/P&gt;&lt;P data-canvas-width="67.51500000000001" style="font-size: 15px; font-family: sans-serif;"&gt;Forward,&lt;/P&gt;&lt;P data-canvas-width="43.349999999999994" style="font-size: 15px; font-family: sans-serif;"&gt;Taken&lt;/P&gt;&lt;/TH&gt;&lt;TH style="text-align: center; background-color: #6690bc; color: #ffffff; padding: 2px;" valign="middle"&gt;&lt;P data-canvas-width="29.159999999999997" style="font-size: 15px; font-family: sans-serif;"&gt;Not&lt;/P&gt;&lt;P data-canvas-width="76.69500000000001" style="font-size: 15px; font-family: sans-serif;"&gt;Predicted,&lt;/P&gt;&lt;P data-canvas-width="67.51500000000001" style="font-size: 15px; font-family: sans-serif;"&gt;Forward,&lt;/P&gt;&lt;P data-canvas-width="72.50999999999999" style="font-size: 15px; font-family: sans-serif;"&gt;Not Taken&lt;/P&gt;&lt;/TH&gt;&lt;TH style="text-align: center; background-color: #6690bc; color: #ffffff; padding: 2px;" valign="middle"&gt;&lt;P data-canvas-width="29.159999999999997" style="font-size: 15px; font-family: sans-serif;"&gt;Not&lt;/P&gt;&lt;P data-canvas-width="76.69500000000001" style="font-size: 15px; font-family: sans-serif;"&gt;Predicted,&lt;/P&gt;&lt;P data-canvas-width="79.20000000000002" style="font-size: 15px; font-family: sans-serif;"&gt;Backward,&lt;/P&gt;&lt;P data-canvas-width="43.349999999999994" style="font-size: 15px; font-family: sans-serif;"&gt;Taken&lt;/P&gt;&lt;/TH&gt;&lt;TH style="text-align: center; background-color: #6690bc; color: #ffffff; padding: 2px;" valign="middle"&gt;&lt;P data-canvas-width="29.159999999999997" style="font-size: 15px; font-family: sans-serif;"&gt;Not&lt;/P&gt;&lt;P data-canvas-width="76.69500000000001" style="font-size: 15px; font-family: sans-serif;"&gt;Predicted,&lt;/P&gt;&lt;P data-canvas-width="79.20000000000002" style="font-size: 15px; font-family: sans-serif;"&gt;Backward,&lt;/P&gt;&lt;P data-canvas-width="72.50999999999999" style="font-size: 15px; font-family: sans-serif;"&gt;Not Taken&lt;/P&gt;&lt;/TH&gt;&lt;TH style="text-align: center; background-color: #6690bc; color: #ffffff; padding: 2px;" valign="middle"&gt;&lt;P data-canvas-width="72.525" style="font-size: 15px; font-family: sans-serif;"&gt;&lt;/P&gt;&lt;P data-canvas-width="72.525" style="font-size: 15px; font-family: sans-serif;"&gt;Predicted&lt;/P&gt;&lt;P data-canvas-width="87.12000000000002" style="font-size: 15px; font-family: sans-serif;"&gt;Correctly as&lt;/P&gt;&lt;P data-canvas-width="43.349999999999994" style="font-size: 15px; font-family: sans-serif;"&gt;Taken&lt;/P&gt;&lt;/TH&gt;&lt;TH style="text-align: center; background-color: #6690bc; color: #ffffff; padding: 2px;" valign="middle"&gt;&lt;P data-canvas-width="70.65" style="font-size: 15px; font-family: sans-serif;"&gt;&lt;/P&gt;&lt;P data-canvas-width="70.65" style="font-size: 15px; font-family: sans-serif;"&gt;Predicted&lt;/P&gt;&lt;P data-canvas-width="87.12000000000002" style="font-size: 15px; font-family: sans-serif;"&gt;Correctly as&lt;/P&gt;&lt;P data-canvas-width="72.50999999999999" style="font-size: 15px; font-family: sans-serif;"&gt;Not Taken&lt;/P&gt;&lt;/TH&gt;&lt;TH style="text-align: center; background-color: #6690bc; color: #ffffff; padding: 2px;" valign="middle"&gt;&lt;P data-canvas-width="72.525" style="font-size: 15px; font-family: sans-serif;"&gt;&lt;/P&gt;&lt;P data-canvas-width="72.525" style="font-size: 15px; font-family: sans-serif;"&gt;&lt;/P&gt;&lt;P data-canvas-width="72.525" style="font-size: 15px; font-family: sans-serif;"&gt;Predicted&lt;/P&gt;&lt;P data-canvas-width="76.695" style="font-size: 15px; font-family: sans-serif;"&gt;Incorrectly&lt;/P&gt;&lt;/TH&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;Bcc&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;7 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;1 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;3 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;7 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;0 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;1 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;7 (0/0)&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;BRA&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;3 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;3 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;0 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;BSR&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;3 (0/1)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;3 (0/1)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;1 (0/1)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;DBcc&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;3 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;8 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;3 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;8 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;2 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;2 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;8 (0/0)&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;DBRA&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;3 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;7 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;3 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;7 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;1 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;1 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;7 (0/0)&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;FBcc&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;8 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;2 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;8 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;2 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;2 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;2 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;8 (0/0)&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;JMP (d16,PC)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;3 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;3 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;0 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;JMP xxx.WL&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;3 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;3 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;0 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;Remaining JMP&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;5 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;5 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;5 (0/0)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;JSR (d16,PC)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;3 (0/1)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;3 (0/1)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;1 (0/1)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;JSR xxx.WL&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;3 (0/1)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;3 (0/1)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;1 (0/1)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;Remaining JSR&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;5 (0/1)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;5 (0/1)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;5 (0/1)&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;TD style="padding: 2px; text-align: center;"&gt;-&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;After studying the chart I can conclude with no certainty that "Not Predicted" equals no entry in the branch cache and:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The default static branch prediction of BTFN (Backward Taken Forward Not) is used.&lt;/P&gt;&lt;P&gt;There is no instruction folding so there is a 3 cycle penalty on static predicted Bcc branches which are not in the branch cache. &lt;/P&gt;&lt;P&gt;A static mis-predicted branch is no worse than a branch history table mis-predicted branch.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Also, we might guess that:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;FBcc does not use the branch cache or branch prediction.&lt;/P&gt;&lt;P&gt;Only PC relative and absolute JMP and JSR use the branch cache.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Of course any of this could have been retuned for the ColdFire v4 or v5 but my guess is that they are more similar than not (with the pipelines becoming more like the 68060). Joe Circello created a wonderful design on the 68060 so why change what works. It's too bad Motorola didn't realize what they had instead of betting the farm on the green pasture on the other side of the fence called PPC.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Tue, 31 Mar 2015 06:52:15 GMT</pubDate>
      <guid>https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Bcc-Instruction-Execution-Times/m-p/344485#M12252</guid>
      <dc:creator>matthey</dc:creator>
      <dc:date>2015-03-31T06:52:15Z</dc:date>
    </item>
    <item>
      <title>Re: Bcc Instruction Execution Times</title>
      <link>https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Bcc-Instruction-Execution-Times/m-p/344486#M12253</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;&amp;gt; It's too bad Motorola didn't realize what they had instead of betting the farm on the green pasture on the other side of the fence called PPC.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Apple had already jumped to PPC by the time the 68060 came out. Apple were competing with Intel-based computers, and were falling behind. Motorola risked Apple jumping all the way to Intel (as if that would ever happen :-), which is why they let someone else take on the huge expense of running the CPU architecture, clock rate, process shrink race and compiler development and support.&lt;/P&gt;&lt;P&gt;.&lt;/P&gt;&lt;P&gt;Motorola did put the 68k architecture to good use in ColdFire.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;When was the last ColdFire core development? The V4e dates from 15 years ago in 2000. Freescales "CPU architecture development group" is now ARM, like pretty much everyone else.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Tom&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Sat, 04 Apr 2015 03:42:17 GMT</pubDate>
      <guid>https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Bcc-Instruction-Execution-Times/m-p/344486#M12253</guid>
      <dc:creator>TomE</dc:creator>
      <dc:date>2015-04-04T03:42:17Z</dc:date>
    </item>
    <item>
      <title>Re: Bcc Instruction Execution Times</title>
      <link>https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Bcc-Instruction-Execution-Times/m-p/344487#M12254</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;BLOCKQUOTE&gt;
&lt;P&gt;Tom Evans wrote:&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;gt; It's too bad Motorola didn't realize what they had instead of betting the farm on the green pasture on the other side of the fence called PPC.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Apple had already jumped to PPC by the time the 68060 came out. Apple were competing with Intel-based computers, and were falling behind. Motorola risked Apple jumping all the way to Intel (as if that would ever happen :-), which is why they let someone else take on the huge expense of running the CPU architecture, clock rate, process shrink race and compiler development and support.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Motorola did put the 68k architecture to good use in ColdFire.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;When was the last ColdFire core development? The V4e dates from 15 years ago in 2000. Freescales "CPU architecture development group" is now ARM, like pretty much everyone else.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Motorola was telling customers that the 68k was end of line and the future was RISC and PPC (it's not surprising what customers decided to do). The idea was that RISC would clock higher and hardware complexity could be moved into the compiler. These proved to be no advantage as RISC needs to clock significantly higher to be competitive and compilers are limited in the assumptions they can make. Motorola had trouble clocking up the PPC also. I wouldn't be surprised if they deliberately decided not to clock the 68060 up because they didn't want it to compete with PPC (rev 6 68060@50MHz overclock to 100MHz most of the time). Amigas with 68060 accelerators and Macintosh emulators were the fastest Macintoshs for awhile. Of course the 68060 couldn't survive with anti-marketing and no improvements.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The ColdFire was weakened considerably compared to the 68060. The performance/MHz dropped dramatically and was never restored even with the v5 CF and code density deteriorated. A superscalar FPGA 68k CPU (in Altera Cyclone V) should soon be outperforming the fastest 68060 and the fastest v4 ColdFire CPU. I believe code density can be improved 10%-25% over CF code and better than Thumb 2. Instead of organically developing a competitive 68k/CF, Freescale (or NXP Semiconductors now) will be licensing ARM technology. Sorry, I can't agree that the cut down ColdFire is good use of the 68k architecture.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Mon, 06 Apr 2015 20:50:10 GMT</pubDate>
      <guid>https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Bcc-Instruction-Execution-Times/m-p/344487#M12254</guid>
      <dc:creator>matthey</dc:creator>
      <dc:date>2015-04-06T20:50:10Z</dc:date>
    </item>
    <item>
      <title>Re: Bcc Instruction Execution Times</title>
      <link>https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Bcc-Instruction-Execution-Times/m-p/344488#M12255</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;&amp;gt; A superscalar FPGA 68k CPU (in Altera Cyclone V)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Would necessarily run at half the clock rate and double the power of real silicon. Good reference here:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="http://boinc.berkeley.edu/Thesis_Eastlack_Nov09.pdf" title="http://boinc.berkeley.edu/Thesis_Eastlack_Nov09.pdf"&gt;http://boinc.berkeley.edu/Thesis_Eastlack_Nov09.pdf&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&amp;gt; I believe code density can be improved 10%-25% over CF code&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I looked into part of this more than a decade ago. I needed to know how to set up GCC to compile for a CPU32 which didn't have some of the fancier 68020 addressing modes. It didn't matter as internally gcc was incapable of representing those addressing modes, so it never used them. That was part of the justification for removing them from the CF. That and them not being faster (or much faster) than the equivalent simple instructions. I suspect the same would still apply. So unless you write a compiler specifically for the 680x0 instructions they're just wasting transistors.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&amp;gt; Freescale (or NXP Semiconductors now) will be licensing ARM technology.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;"Will"? They have been doing that for over 14 YEARS, starting with the i.MX1 in 2001 and then the MAC7100 in 2003 or 2004.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&amp;gt; Sorry, I can't agree that the cut down ColdFire is good use of the 68k architecture.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I don't consider "dominance of the desktop, needing a huge heatsink" to be the pinnacle of success. The PPC had a nice long run in Macintoshes, but there's a lot of them running in embedded devices. There are multiple MCP55xx series for Automotive use, one of which we use.. Likewise, the MCF in all the versions makes a very nice embedded controller. We've used the MCF5329 in an Automotive product, and it runs very nicely at 240MHz. Without a heatsink too.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Tom&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Sun, 12 Apr 2015 23:51:38 GMT</pubDate>
      <guid>https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Bcc-Instruction-Execution-Times/m-p/344488#M12255</guid>
      <dc:creator>TomE</dc:creator>
      <dc:date>2015-04-12T23:51:38Z</dc:date>
    </item>
    <item>
      <title>Re: Bcc Instruction Execution Times</title>
      <link>https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Bcc-Instruction-Execution-Times/m-p/344489#M12256</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;BLOCKQUOTE&gt;
&lt;P&gt;Tom Evans wrote:&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;gt; A superscalar FPGA 68k CPU (in Altera Cyclone V)&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Would necessarily run at half the clock rate and double the power of real silicon. Good reference here:&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;A class="jive-link-external-small" data-content-finding="Community" href="http://boinc.berkeley.edu/Thesis_Eastlack_Nov09.pdf" onclick="" rel="nofollow noopener noreferrer" target="_blank"&gt;http://boinc.berkeley.edu/Thesis_Eastlack_Nov09.pdf&lt;/A&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'm aware of the limitations of FPGA technology. I have worked with the group making the Apollo/Phoenix 68k FPGA.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="http://apollo-core.com/" rel="nofollow noopener noreferrer" title="http://apollo-core.com/" target="_blank"&gt;APOLLO - High Performance Processor&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This FPGA superscalar 68k CPU has a clock speed of ~80MHz in an 8000LE Altera Cyclone II FPGA where it is very cramped but still manages 68060 level performance. A higher performance 3 integer unit version of the core in a Cyclone V is being tested and optimized but at 100-150MHz is already giving about 3 times the performance. This is better performance than the fastest ColdFire v4. An FPGA is limited in clock speed but parallel processing and taking advantage of memory bandwidth can give a lower clocked processor with performance better than some hard processors. Our 68k code analysis found that the 68k has short enough instructions (~3 bytes/instruction) to make 3 integer units worthwhile and this can be improved with ISA changes. It may be possible to dual port the cache memory allowing for 2 cache reads per cycle (takes advantage of excess FPGA memory bandwidth) which I believe would give significantly better performance as this is the limitation in most processors. CPU performance in an FPGA is not a problem for embedded uses where lower clock speed is an advantage. Power efficiency would not be as good as a hard processor because of leakage but there are relatively low cost options to improve this like the eASIC. An ARM CPU probably makes sense where power efficiency is more important than performance. A high performance enhanced 68k has the potential to be a better Atom processor. Instructions sizes, code density and instruction decoding are better than the x86 and it is not necessary to add more registers like x86_64 to reduce cache accesses.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE class="jive_text_macro jive_macro_quote"&gt;
&lt;P&gt;Tom Evans wrote:&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;gt; I believe code density can be improved 10%-25% over CF code&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I looked into part of this more than a decade ago. I needed to know how to set up GCC to compile for a CPU32 which didn't have some of the fancier 68020 addressing modes. It didn't matter as internally gcc was incapable of representing those addressing modes, so it never used them. That was part of the justification for removing them from the CF. That and them not being faster (or much faster) than the equivalent simple instructions. I suspect the same would still apply. So unless you write a compiler specifically for the 680x0 instructions they're just wasting transistors.&lt;/P&gt;
&lt;/PRE&gt;&lt;P&gt;&lt;BR /&gt;I've analyzed a lot of 68k code (I improved and modified a "smart" 68k disassembler to produce statistics) including many different versions of GCC. GCC has used the double indirect addressing modes for a long time (at least since GCC 3.x where these addressing modes can be found in the compiler's executables). These addressing modes are useful for object oriented code like C++ and sometimes save a register and improve code density. They are challenging to execute quickly without OoO execution though. A superscalar in order processor can give faster code in cases where these addressing modes can be split into separate instructions and re-scheduled.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The 68060 and ColdFire dropped the 32*32=64 hardware integer multiplication which GCC was also using for optimizations despite Motorola deciding it wasn't necessary (it was useful enough for most ARM processors to add though). GCC since 2.x has turned 32 bit division by a constant into a multiplication which requires the high 32 bits of the product. I'm actually working on similar functionality I hope can be used by vbcc for 32 bit and 16 bit divisions (GCC does not optimize 16 bit integer divisions which should work with the 68k and CF 16*16=32). Advanced compilers use LEA optimizations as you pointed out in the other thread even though this is challenging with the 68k where there is a lack of orthogonality. The 68000 used a split register file while all following (and future) 68k processors use a monolithic register file which allows opening up address register sources. This reduces the number of instructions improving performance, improves code density and simplifies compiler code generation (OoO processors could open up address register destinations but this can give a load use bubble without it and there isn't as much of an advantage). There is a good place to encode LEA EA,Dn which would improve orthogonality. It's possible to add a very simple but very powerful addressing mode which would do the same as LEA but could do an ALU calculation in the same pipe in the same cycle (up to 5 operations per pipe per cycle but with load use bubbles without OoO). Hardware designers seem to ignore compiler designers where good communication is crucial. Motorola was reducing performance while getting rid of functionality which was added a while later to lower end ARM processors.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Here are some of the ideas we came up with for an enhanced 68k ISA.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="http://www.heywheel.com/matthey/Amiga/68kF_PRM.pdf" rel="nofollow noopener noreferrer" title="http://www.heywheel.com/matthey/Amiga/68kF_PRM.pdf" target="_blank"&gt;http://www.heywheel.com/matthey/Amiga/68kF_PRM.pdf&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I wanted to create a new compatible open 68k+CF ISA standard which could also be used for other FPGA 68k processors and emulators but Gunnar von Boehn added his own radical ISA enhancements which are not documented (although using some of the ideas from the link above).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE class="jive_text_macro jive_macro_quote"&gt;
&lt;P&gt;Tom Evans wrote:&lt;/P&gt;
&lt;P&gt;&amp;gt; Freescale (or NXP Semiconductors now) will be licensing ARM technology.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;"Will"? They have been doing that for over 14 YEARS, starting with the i.MX1 in 2001 and then the MAC7100 in 2003 or 2004.&lt;/P&gt;
&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Sure. I didn't try to make it sound like a new thing but rather that they have no problem, up to now, with paying their competitors when Motorola was once one of the leading processor innovators. Now Motorola is a Chinese company and Freescale will be a Dutch company. It's sad to see but they ignored their own organic technology innovations and didn't listen enough to their customers.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE class="jive_text_macro jive_macro_quote"&gt;
&lt;P&gt;Tom Evans wrote:&lt;/P&gt;
&lt;P&gt;&amp;gt; Sorry, I can't agree that the cut down ColdFire is good use of the 68k architecture.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I don't consider "dominance of the desktop, needing a huge heatsink" to be the pinnacle of success. The PPC had a nice long run in Macintoshes, but there's a lot of them running in embedded devices. There are multiple MCP55xx series for Automotive use, one of which we use.. Likewise, the MCF in all the versions makes a very nice embedded controller. We've used the MCF5329 in an Automotive product, and it runs very nicely at 240MHz. Without a heatsink too.&lt;/P&gt;
&lt;/PRE&gt;&lt;P&gt;&lt;BR /&gt;My overclocked 68060@75MHz runs very cool. I doubt it would need a fan at all. The 68060 was designed to be usable in a laptop after the 68040 was one of the hottest processors of all time. It's a shame that Apple didn't create a 68060 Mac laptop where efficient resource use is more of an advantage.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I don't have anything against the PPC. It has some innovative features although the general thinking at the time of ISA creation was that hardware complexity could be moved into the compiler which failed. Instruction acronyms got out of hand also. I just love working with the 68k because of how readable the code is. Of course PPC and 68k/CF are dying and nearly done due to marketing reasons and lack of development. There really isn't much difference between PPC and ARM v8 which will likely replace PPC. I would rather stay with PPC considering how little difference there is and the work necessary to support a new ISA. The PPC backend of vbcc is better than for any other processor. Volker Barthelmann has experience in automotive embedded and created vbcc with that in mind. It supports MISRA C for example. I'm not sure the ARM backend even works in comparison despite the "embedded" focus.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Tue, 14 Apr 2015 22:36:44 GMT</pubDate>
      <guid>https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Bcc-Instruction-Execution-Times/m-p/344489#M12256</guid>
      <dc:creator>matthey</dc:creator>
      <dc:date>2015-04-14T22:36:44Z</dc:date>
    </item>
  </channel>
</rss>

