Hello all.
I'm working on a rather software-intensive application (video encoder) at the moment, and am trying everything possible to optimise my code. I have discovered that the on-board 32kB SRAM does not seem to be as fast as it should be:
After writing some simple data-copying benchmarks, I have the following results:
DDR to DDR copy = 40.8 MB/s (copyback mode)
DDR to DDR copy = 79.2 MB/s (write-through mode)
SRAM to SRAM copy = 129.6 MB/s
Cache to cache copy = 399.2 MB/s
(Platform is MCF54450 at 240MHz with mobile-DDR external memory. Copies run with simple MOVEM loops - not DMA)
Now, the datasheet says the SRAM is single-cycle and on the processor's local high-speed bus. Should this not therefore give a bandwidth of 960MB/s, or copy performance of 480MB/s for 32-bit wide memory? I'm getting this sort of speed from the cache, but the SRAM seems far too slow. If the SRAM is this slow, what's the point of it? Code or data structures will always be faster in cached external memory.
I have played about with the RAMBAR settings, and am accessing the SRAM properly - not through the backdoor!
Thanks for any advice!
Steve.
PS. If anyone's interested in my benchmarking routines, let me know and I'll post them.
已解决! 转到解答。
I've now found the solution to this one;
First, I must thank TomE for prompting me to double-check the RAMBAR to make sure I wasn't accessing through the backdoor. I tried to get a backdoor access benchmark for comparison, and it came out the same speed. Long story short, my assembler is writing $C04 instead of $C05 for the RAMBAR, so after looking up the MOVEC opcode, I now have reasonable SRAM performance:
SRAM to SRAM copy (backdoor) = 129.6 MB/s
SRAM to SRAM copy (direct) = 351.6 MB/s
Cache to cache copy = 399.2 MB/s
So, still not quite as fast as the cache, but I now see no performance difference in my application code whether I use SRAM or cached DDR for the data.
Steve.
Me again. I suggested checking the RAMBAR settings in your other post on optimisation before reading this one.
Concerning memory speed. I'm using a 240MHz MPC5329. It is documented (but not well) as having a maximum memory bandwidth of 128MB/s with an 80MHz SDRAM clock. I can't get anything (even DMA) running at that speed.
That throughput assumes burst reads or writes to and from SDR or DDR. The memory controller is limited to taking 10 clocks to issue Precharce, Bank Select and then four Reads (four 32-bit reads for SDN, 8 double-clock 16-bit reads at exactly the same data rate for DDR). Except I'm measuring 11 clocks instead of 10, and neither the CPU or DMA can keep up with it.
The SDRAM controller will NOT hold banks open, and insists on taking 6 clocks on precharge and bank-select, even between bursts to successive memory addresses. It looks like the V4 one does the same.
https://community.freescale.com/thread/59655
The only instructions in the V3 Coldfire that can WRITE to memory in a four-clock burst are the MOVEM.L ones. Normal Moves seem to write one word at a time, non-bursting. I see you're using these already.
https://community.freescale.com/message/60698#60698
These are copy speeds, double them to get bandwidth figures.
Function Four runs in microseconds Average Stdev MB/s Theorymemcpy 4369 4310 4312 4308 4324.75 29.55 38.85 60.70%memcpy_gcc_2 4399 4400 4399 4399 4399.25 0.5 38.19 59.67%memcpy_gcc_4_3 4439 4439 4438 4439 4438.75 0.5 37.85 59.14%memcpy_gcc_4_4 4833 4831 4832 4832 4832 0.82 34.77 54.33%memcpy_moveml 3625 3625 3625 3625 3625 0 46.34 72.41%
You're getting 79MB/s. If that's the copy rate then you're getting double what I'm getting. If you're quoting "bandwidth" then you're slower than I am with movem.l. I think you're running the memory at 120MHz so you should be 1.5 times faster than the V3 core at 80MHz.
Make sure your source and destination buffers are 16-byte aligned.
The problem with the "copyback" is that it has to READ the destination cache line before you then overwrite ALL of it, so the copy runs at 1/3 of the bandwidth (read, read, write). A Writethrough runs at 1/2 (read, write). What would be nice to do in your copy if you want to use copyback would be to invalidate the cache line before writing to it. The PPC core has instructions for this, but the Coldfire doesn't.
I don't know why your copyback is running at 1/2 of your writethrough speed though. is this as you expect?
Hello TomE,
The figure of 79 MB/s is a copy speed. If I benchmark read, write, and copy separately I get these figures:
Read = 136.7 MB/s
Write = 254.5 MB/s
Copy = 79.3 MB/s
This is with mobile-DDR clocked at 120MHz (16-bit wide).
Based on your figure of 10 clocks minimum for a 16-byte transfer, I make the maximum to be 192 MB/s. I can't find any timing diagrams in the datasheet, but it has to be less than 10 clocks to be able to get a write speed of 254 MB/s, in fact, no more than 7.5 clocks.
I'm starting to doubt my benchmark is correct now, although I can't see anything wrong. The code is thus:
move.l #65536,d1 ;number of 16-byte bursts to do (= 1MB)
moveq.l #64,d0 ;constant to add to address
movea.l #SDRAM_SCRATCH,a0 ;somewhere to write to
move.w #$2700,SR ;no interruptions please
move.l DTCN0,d7 ;start the clock
loop
subq.l #4,d1 ;dec loop counter (4 bursts) & update CCR
movem.l d2-d5,(a0) ;burst write
movem.l d2-d5,16(a0)
movem.l d2-d5,32(a0)
movem.l d2-d5,48(a0)
adda.l d0,a0 ;advance address
bgt.s loop ;CCR from subq earlier
move.l DTCN0,d0 ;stop the clock
move.w #$2000,SR
sub.l d7,d0 ;subtract start clock, d0 = microseconds to write 1MB
move.l #1000000000,d1 ;(one billion)
divu.l d0,d1 ;d1 is kB/s (or MB/s to 3 d.p.)
DTCN0 is free-running at 1MHz. I know this is right as it's used by the OS for generating delays. As you can see, I am just writing whatever garbage is in d2-d5, but that's not the point - it's a memory speed test.
Maybe I should write a DMA speed test - presumably that would give me the absolute memory speed as there'll be no instructions to slow it down.,,
As for the copyback, 1/2 the speed sounds right to me for a straight copy routine, as the CPU stalls while the cache writes the old line out before it can accept a new line. In write-through mode the write to DDR and the write to cache are done at the same time. I couldn't say whether it should be 1/2 the speed, or 1/3 of the speed, but I'm not surprised by the figures I'm getting.
Steve.
> Based on your figure of 10 clocks minimum for a 16-byte transfer
That's for the V3 SDRAM controller. The V4 one looks to be better. It has:
- Supports page mode for decreased latency and higher bandwidth; remembers
- one active row for each bank; four independent active rows per each chip select
If you can get your read and write addresses in different BANKS it should go faster. The worst thing would be alternating short (non-cache-line or single-cache-line length) reads and writes in the same bank. Maybe your writeback memory test is doing that (flush a line, read a line) where your writethrough test can get longer bursts.
> Maybe I should write a DMA speed test - presumably that would give me
> the absolute memory speed as there'll be no instructions to slow it down.,,
I found the V3 DMA to be as slow as the CPU. Monitoring the memory bus with a CRO showed that after the burst read and write the DMA controler just stalled. My guess is that it isn't pipelined and took multiple clocks to do something, presumably to add the offsets to the source and destination addresses and to decrement and test the counter.
Good point about putting the read and write in different banks:
Copy (same bank) = 79.3 MB/s
Copy (different banks) = 89.5 MB/s
I tried the DMA copy too, with disappointing results:
Copy (DMA, different banks) = 62.8 MB/s
Steve.
I've now found the solution to this one;
First, I must thank TomE for prompting me to double-check the RAMBAR to make sure I wasn't accessing through the backdoor. I tried to get a backdoor access benchmark for comparison, and it came out the same speed. Long story short, my assembler is writing $C04 instead of $C05 for the RAMBAR, so after looking up the MOVEC opcode, I now have reasonable SRAM performance:
SRAM to SRAM copy (backdoor) = 129.6 MB/s
SRAM to SRAM copy (direct) = 351.6 MB/s
Cache to cache copy = 399.2 MB/s
So, still not quite as fast as the cache, but I now see no performance difference in my application code whether I use SRAM or cached DDR for the data.
Steve.
