240MHz MCF5329 on our own board with 32-bit SDRAM and FLASH on the FlexBus. The LCD is in use, but disabled for these tests.
I'm testing memory copy and write speed and maxing at 80MB/s. I'd hope for higher with 80MHz 32-bit wide SDRAM.
The clocks are running at the right speed (240MHz/80MHz).
The Cache is on, in Writethrough mode (writeback is a lot slower) and the cache write queue is on (also makes a big difference). The Crossbar is set up to allow bursting. I've got interrupts and all DMA disabled during these tests. Code is running from SDRAM,but running it from SRAM only makes a tiny (1%) difference.
I can't find any information in the data sheet or user manual on what the expected SDRAM bandwidth should be. There are simple diagrams in the hardware manal, but they only give a RAS/CAS cycle and don't detail what back to back cycles with the SDRAM pages open should look like.
"AN3606 Understanding LCD Memory and Bus Bandwidth Requirements" gives the bandwidth as 128MB/s, without any derivation. This corresponds to 8 million four-word burst transfers per second, or 10 clocks at 80MHz. That seems excessive overhead, but it might be right.
I'm measuring 80MB/s for a sustained write to SDRAM (with the library memset), which is 16 clocks per burst - six clocks higher than that implied by AN3606.
Copying 192k blocks of data from SDRAM to SDRAM with a good library memcpy() gives about half that (as expected).
Turning the LCD DMA on at 11.5MB/s drops the CPU performance by 15%. So on that measure 100% is 11.5/15% or 77MB/s, matching the above figures.
At the moment it looks like the SDRAM timing might be suboptimal - I'm going to check that again.
Would the FLASH on the FlexBus be slowing the SDRAM down? The code isn't executing from there (all code is copied from FLASH to SDRAM and run from there).
Can anyone sugest anything else we might have missed?
Thanks.
Solved! Go to Solution.
I wrote:
> I'm measuring 80MB/s for a sustained write to SDRAM...
>
> Copying ... SDRAM to SDRAM with a good library memcpy() gives
> about half that (as expected).
Which is less than the memory bandwidth implied in the App Note.
More testing indicates that SDRAM on this chip at 80MHz reads or writes 16 bytes in 10 clocks. That's a raw bandwidth of 128MB/s. It looks like that speed is only available to DMA and not to the CPU.
With the CPU moving the data, it gets stalled waiting for cache reads or flushes. As it isn't deeply pipelined (like expensive Pentium chips) it doesn't have anything else to do during these stalls. As a result, the above figures seem to be the maximum sustained memory bandwidth available to the CPU.
I wrote:
> I'm measuring 80MB/s for a sustained write to SDRAM...
>
> Copying ... SDRAM to SDRAM with a good library memcpy() gives
> about half that (as expected).
Which is less than the memory bandwidth implied in the App Note.
More testing indicates that SDRAM on this chip at 80MHz reads or writes 16 bytes in 10 clocks. That's a raw bandwidth of 128MB/s. It looks like that speed is only available to DMA and not to the CPU.
With the CPU moving the data, it gets stalled waiting for cache reads or flushes. As it isn't deeply pipelined (like expensive Pentium chips) it doesn't have anything else to do during these stalls. As a result, the above figures seem to be the maximum sustained memory bandwidth available to the CPU.
More info. I've now got the LCD controller generating an almost standard VGA signal. This is with 16 bits per pixel at 26.666MHz pixel clock and 6-Hz frame rate. And a WVGA version of same as well.
DMA Parking on the CPU or the last master doesn't make any difference.
Changing the LCD DMA Control Register (LDCR) from "Dynamic" to "fixed burst of 20 words, low limit of 12" (it underruns at 8) looks to make it more efficient as the bursts are now always a multiple of 16 bytes. "Dynamic' gives bursts that aren't a multiple of 16.
The CRO shows the LCD DMA controller is performing five reads of four words each spaced at 1500ns, but taking 11 clocks per read instead of the 10 implied by the LCD Bandwidth App Note (AN3606). So the peak bandwidth is 45.8% instead of the App note's 41.7%
According to the App Note and my calculations, VGA should average about 30% of the SDRAM bandwidth. The peak bandwidth should be 80M/3*2/128 = 41.7%. That should leave about 60% for the CPU. That needs to be adjusted by the horizontal and vertical active/blanking ratios. Measurements (execution time of memory-access limited code) give something a little less than that.
Monitoring the memory bus shows that the recommended "dynamic" setting of the LCD DMA Control Register (LDCR) seems to cause partial memory transfers. A fixed transfer burst of 20 words with a low-limit of 12 generates the expected 5 burst of 4-burst memory reads spaced at the expected 1500us. If the 16-byte memory burst accesses were taking 10 clocks like the App Note says (giving 128MB/s as the max) then that should be 41.67%, but the bursts are 11 clocks long, and not 10 so it is taking 45.8%.
When the CPU is memory bound (most of the time) this slows it down by the expected amount. When it is compute-bound it doesn't slow as much.
The CPU can't copy memory all that quickly - even with a movem.l-based copy it only gets 76% of the theoretical throughput. Has anyone tested a DMA-controller-based memory copy to see how fast it can go?
TomE wrote:...Has anyone tested a DMA-controller-based memory copy to see how fast it can go?
I have now. Read the following post for results:
https://community.freescale.com/message/60840#60840
I haven't run any tests to see what happens when the CPU tries to do something useful while the DMA is transferring. With the wrong XBS programming they may take longer "in parallel' than performing the operations "in series". With the wrong XBS programming the CPU might be forced into serial operaiton.
I've also checked the memory alignments for these copies and writes. They're all 4-byte aligned or better (16-byte cache-line size). On these CPUs alignment matters.
I've tried offsetting the alignments between source and destinations (4, 8, 12, 16 ...) in case it helps with the phasing of cache reads and writes. The differences are measureable, but tiny.
The memcpy() inner loop is using unrolled long move instructions (move.l (a0)+, (a1)+) that should execute in three clocks, so it shouldn't be CPU bound.
There are minor differences for different source addresses, probably due to cache contention and/or SDRAM open pages.