Coldfire compatible FPGA core with ISA enhancement - Brainstorming

GunnarVB · ‎05-14-2013

Hello,

I work as chip developer.

While creating a super scalar Coldfire ISA-C compatible FPGA Core implementation,

I've noticed some possible "Enhancements" of the ISA.

I would like to hear your feedback about their usefulness in your opinion.

Many thanks in advance.

1) Support for BYTE and WORD instructions.

I've noticed that re-adding the support for the Byte and Word modes to the Coldfire comes relative cheap.

The cost in the FPGA for having "Byte, Word, Longword" for arithmetic and logic instructions like

ADD, SUB, CMP, OR, AND, EOR, ADDI, SUBI, CMPI, ORI, ANDI, EORI - showed up to be neglect-able.

Both the FPGA size increase as also the impact on the clockrate was insignificant.

2) Support for more/all EA-Modes in all instructions

In the current Coldfire ISA the instruction length is limited to 6 Byte, therefore some instructions have EA mode limitations.

E.g the EA-Modes available in the immediate instruction are limited.

That currently instructions can either by 2,4, or 6 Byte length - reduces the complexity of the Instruction Fetch Buffer logic.

The complexity of this unit increases as more options the CPU supports - therefore not supporting a range from 2 to over 20 like 68K - does reduce chip complicity.

Nevertheless in my tests it showed that adding support for 8 Byte encoding came for relative low cost.

With support of 8 Byte instruction length, the FPU instruction now can use all normal EA-modes - which makes them a lot more versatile.

MOVE instruction become a lot more versatile and also the Immediate Instruction could also now operate on memory in a lot more flexible ways.

While with 10 Byte instructions length support - there are then no EA mode limitations from the users perspective - in our core it showed that 10 byte support start to impact clockrate - with 10 Byte support enabled we did not reach anymore the 200 MHz clockrate in Cyclone FPGA that the core reached before.

I'm interested in your opinion of the usefulness of having Byte/Word support of Arithmetic and logic operations for the Coldfire.

Do you think that re-adding them would improve code density or the possibility to operate with byte data?

I would also like to know if you think that re-enabling the EA-modes in all instruction would improve the versatility of the Core.

Many thanks in advance.

Gunnar

TomE · ‎05-14-2013

One problem would be the compiler. You could probably use gcc in 68020 mode, but you couldn't develop software for this using any of Freescale's tools.

Any extensions beyond ColdFire or 68k would be impossible to generate code for. You wouldn't even be able to use assembly as the current assemblers would refuse to compile your code.

Nice idea, just 20 to 30 years too late.

One justification for removing the EA modes in Coldfire was that compilers seldom or never use these modes. Gcc for instance had an internal model that meant it couldn't generate these instructions. I remember reading that benchmarking showed the more complex address modes to execute SLOWER than multiple simple instructions doing the same thing. Your implementation may not have this problem, but still it would never get used.

Compilers are optimised to move data into registers, operate on it there and write it back. They probably wouldn't use the byte-and-word-to-memory addressing modes, even if they could.

Tom

GunnarVB · ‎05-15-2013

Tom Evans wrote:

Nice idea, just 20 to 30 years too late.

FPGA technology did improve a lot in the recent years. :-D

20 years ago it was not possible to create a good CPU in an FPGA.

But today you can create a Coldfire FPGA Core which runs from 200 MHz on low end FPGA to 400 MHz on high end FPGA. Such an FPGA Coldfire provides more Mips and more Flops and more Memory bandwidth than any available ASIC Coldfire Core does right now.

:-D

TomE · ‎05-15-2013

GunnarVB said:

I said:

Nice idea, just 20 to 30 years too late.

20 years ago it was not possible to create a good CPU in an FPGA.

I meant that the changes to the architecture might have been nice that long ago, but the push from CISC to RISC was done for a reason. That's partly why the ARM cores are everywhere now.

more Mips and more Flops and more Memory bandwidth than any available ASIC Coldfire Core does right now

They max out at 266MHz because they're not trying to compete head-on with 1.2GHz ARM cores.and 2.2GHz PPC cores.

Why would you choose a very fast ColdFire core over those other CPU architectures if you needed that power?

I don't think it would be either easy or worthwhile to try and get a compiler to use the "extended features" you describe. The compilers "know" what modes do and don't work, and would need to be rewritten or at least reconfigured to remove the built-in restrictions. That change might expose bugs in the compiler that would then need fixing. All the different ARM cores out there seem to keep causing new problems for the compiler (I've been following the arm-gnu mailing list for years). There's an advantage to an architecture that hasn't changed in decades.

Tom

GunnarVB · ‎05-15-2013

Tom Evans wrote:

I meant that the changes to the architecture might have been nice that long ago, but the push from CISC to RISC was done for a reason.

Whether RISC or CISC is the better architecture or whether CISC or RISC offer more performance is a discussion which does not help and goes offtopic.

One thing is clear "CISC can offer more dense code" - and dense code is a strong point for the 68k/Coldfire Architecture.

The code density of the 68K was better than the Coldfire - is this a fact.

Re-adding the lost features of the 68K to Coldfire to give again a more dense code is very simple.

Tom Evans wrote:

Why would you choose a very fast ColdFire core over those other CPU architectures if you needed that power?

Why not?

I've here a Coldfire Core which runs in an FPGA with 400 MHz.

If we would "bake" this very same Core in a proper Asic, then it would easily run at over 1.6 GHz

Why should I (or any customer) be forced to switch architecture - if the architecture we use can fulfil all performance needs?

Tom Evans wrote:

I don't think it would be either easy or worthwhile to try and get a compiler to use the "extended features" you describe. .

1) Support for these features was in GCC since the beginning.

2) GGC is open source.

3) I know a couple people which work full time as GCC maintainers.

Compiler support is the smallest problem IMHO.

Cheers

TomE · ‎05-16-2013

> Why should I (or any customer) be forced to switch architecture

I'm not familiar with the detailed inner working of the ARM architecture, but the PPC has some useful advantages over the Coldfire cores I'm familiar with. One is its ability to keep executing instructions while waiting for previous register loads to complete (scoreboarding and out-of-order execution). With the V3 Coldfire, every memory read that misses in the cache seems to stall the CPU dead. It can't do anything else until the data is available. A measure of this is that on a 240MHz MCF5329 I can write to external RAM at 207MB/s, but can only read at 87MB/s due to the CPU stalls.

Is the V4 core you're working with better at this? It is documented to have some "renaming" and has limited "dual issue", but can it perform proper scoreboarding and out-of-order execution?

Of course if the application of the core is to run in an ASIC with enough zero-wait-state RAM for the required application, then this doesn't matter.

Tom

GunnarVB · ‎05-17-2013

Tom Evans wrote:

A measure of this is that on a 240MHz MCF5329 I can write to external RAM at 207MB/s, but can only read at 87MB/s due to the CPU stalls.

We did measure STREAM memcopy performance on a Stratix-4 dev board one year ago.

The core did reach 980 MB/sec in this benchmark running on DDR3 memory on the board.

We were actually disappointed by that number - the core should have reached more, but the FPGA memory interface with not perfect in this setup.

I think most important for this benchmark is the cache design.

1) The core comes with a memory stream detection engine which will automatically prefetch.

This is a big advantage. I'm not aware that any of the Freescale PowerPC can do this.

2) The cache can keep several (4) memory reads in flight in parallel.

3) The data cache of this core offer parallel read and write from the ALU ,

and also a parallel write from the memory prefetch engine.

Tom Evans wrote:

With the V3 Coldfire ..

Is the V4 core you're working with better at this? It is documented to have some "renaming" and has limited "dual issue", but can it perform proper scoreboarding and out-of-order execution?

Yes the Coldfire V3 is "erm" not that impressive.

The core I'm working with is not a V4 - is a new core not based on any of the previous Freescale cores.

In a nutshell the features look somewhat like this:

- Up to 2 instructions per cycle.

Of these 2 instruction only 1 instruction can do a memory access.

The other instruction needs to be of RISC like Reg operation type.

For example these two instruction could be executed together in a single cycle:

ADDq.L #8,D1

AND.L D2,(d16,A1)

The reason that only 1 instruction can do memory access is that only one MMU address translation is done per cycle.

- The ALU can do 1 memory read and 1 memory write per cycle.

This means

NEG.L (A1)+

Does execute in a single cycle

- Very important for real live performance is that the core does automatic

Instruction cache prefetching, and data cache prefetching.

This means you need not to write special code for this.

Normal code like

.loop

move.l (A0)+,(A1)+

subq.L #1,D0

bne .loop

Such code will already run very good.

- The core provides Super Scalar execution - but it does NOT reorder normal ALU instructions.

In this regard the core is very similar to the 68060, the Coldfire V4 and Coldfire V5.

The super scalar performance improves by seeting the compiler switch accordingly.

This means compiler generated code for these 3 CPUs will run very good on the core.

- Memory miss behavior:

The code does a limited scorebording and it can actually do a cache read under a miss.

This means if a number of conditions are fulfilled (they normally are) then the code will not stall on a cache miss.

At the end of the day what makes a good core is a good combination.

I think cache features, like prefetching, write combining, parallel read and write and handling of misaligned access

is what counts. Also branch prediction is really important.

I recall a user compared the Coldfire core with a PowerPC 440 using Sieve benchmark.

The Coldfire did beat the PowerPC hands down - which showed that the feature mix is good.

Cheers

GunnarVB · ‎05-15-2013

Hello Tom,

> One justification for removing the EA modes in Coldfire was that compilers seldom or never use these modes.

I agree that for example memory indirect EA modes are seldom used, and as they can easily be replaced by 2 instructions they are not that needed.

Also the all modes that do not use the brief extension word but need the full word - have a painful encoding.

Removing these modes makes decoding several instructions per cycle as needed for a super scalar design a lot easier.

Therefore I agree that removing all modes requiring the extended extension word format - was a good choice on the Coldfire.

What I was trying to say is that the normal EA-Modes which the compiler can of course use

like:

OPP (d16,Sp)

or

OPP $12345678

These modes are all supported by Coldfire but not equally available to all instruction.

The reason for this is just the instruction length limit of 6 Bytes of the Codlfire.

Compiler support for these modes is all there - so compiler would be able to use these modes.

If you lift the 6 Byte Limit then you can enable instruction like this for free:

ADDI.L #1235,(-48,A7)

This makes working on stack variable a lot easier, and should improve code density.

Coldfire compatible FPGA core with ISA enhancement - Brainstorming

Coldfire compatible FPGA core with ISA enhancement - Brainstorming

General