Tom Evans wrote:
A measure of this is that on a 240MHz MCF5329 I can write to external RAM at 207MB/s, but can only read at 87MB/s due to the CPU stalls.
We did measure STREAM memcopy performance on a Stratix-4 dev board one year ago.
The core did reach 980 MB/sec in this benchmark running on DDR3 memory on the board.
We were actually disappointed by that number - the core should have reached more, but the FPGA memory interface with not perfect in this setup.
I think most important for this benchmark is the cache design.
1) The core comes with a memory stream detection engine which will automatically prefetch.
This is a big advantage. I'm not aware that any of the Freescale PowerPC can do this.
2) The cache can keep several (4) memory reads in flight in parallel.
3) The data cache of this core offer parallel read and write from the ALU ,
and also a parallel write from the memory prefetch engine.
Tom Evans wrote:
With the V3 Coldfire ..
Is the V4 core you're working with better at this? It is documented to have some "renaming" and has limited "dual issue", but can it perform proper scoreboarding and out-of-order execution?
Yes the Coldfire V3 is "erm" not that impressive.
The core I'm working with is not a V4 - is a new core not based on any of the previous Freescale cores.
In a nutshell the features look somewhat like this:
- Up to 2 instructions per cycle.
Of these 2 instruction only 1 instruction can do a memory access.
The other instruction needs to be of RISC like Reg operation type.
For example these two instruction could be executed together in a single cycle:
ADDq.L #8,D1
AND.L D2,(d16,A1)
The reason that only 1 instruction can do memory access is that only one MMU address translation is done per cycle.
- The ALU can do 1 memory read and 1 memory write per cycle.
This means
NEG.L (A1)+
Does execute in a single cycle
- Very important for real live performance is that the core does automatic
Instruction cache prefetching, and data cache prefetching.
This means you need not to write special code for this.
Normal code like
.loop
move.l (A0)+,(A1)+
subq.L #1,D0
bne .loop
Such code will already run very good.
- The core provides Super Scalar execution - but it does NOT reorder normal ALU instructions.
In this regard the core is very similar to the 68060, the Coldfire V4 and Coldfire V5.
The super scalar performance improves by seeting the compiler switch accordingly.
This means compiler generated code for these 3 CPUs will run very good on the core.
- Memory miss behavior:
The code does a limited scorebording and it can actually do a cache read under a miss.
This means if a number of conditions are fulfilled (they normally are) then the code will not stall on a cache miss.
At the end of the day what makes a good core is a good combination.
I think cache features, like prefetching, write combining, parallel read and write and handling of misaligned access
is what counts. Also branch prediction is really important.
I recall a user compared the Coldfire core with a PowerPC 440 using Sieve benchmark.
The Coldfire did beat the PowerPC hands down - which showed that the feature mix is good.
Cheers