I've been running more tests to try and find the most efficient memory copy functions.
This is on a 240MHz MCF5329.
The SDRAM is clocked at 80MHz and can read 4 bytes per clock, so that's an "ultimate bandwidth" of 320MB/s. The CPU is theoretically 960MB/s. But the normal memcpy() can only manage about 7% of that!
One of the App Notes claims the LCDC can read the RAM at 128MB/s, which equates to 10 80MHz clocks to read 4 32-bit words, so 6 clocks of overhead for 4 working clocks.
Here's a table of memory copy functions.
Function Min Max Aver StDev Max Avg Speed
us us us us kb/s kb/s
===========================================================
memcpy_gcc_4_4 4073 4246 4202 78.1 32180 31202
memcpy_gcc_4_3_O1 3788 3939 3919 53.0 34601 33448 +17%
memcpy_gcc_4_3_O2 3717 3937 3909 77.5 35262 33543
memcpy_gcc_2 3717 3935 3816 102.0 35262 34367
memcpy(131072) 3734 3915 3829 56.4 35102 34241 Reference
memcpy_moveml 3132 3305 3283 61.2 41849 39932 +17%
memcpy_dma 2993 2994 2994 0.5 43792 43783 +28%
memcpy_moveml_32 2500 2612 2543 42.1 52428 51564 +51%
memcpy_stack 2390 2475 2438 26.6 54841 53762 +57%
memcpy_stack_32 2265 2344 2317 25.2 57868 56572 +65%
The above table gives the minimum, maximum and average time to copy 128 kbytes from SDRAM to SDRAM. These measurements were conducted with interrupts and all DMA disabled. The Cache is set to write-through. All copies are multiples of 16 bytes, all aligned on 16-byte boundaries to match the cache line length. This is an artificial situation for general memory copies, but I'm copying bitmaps around, and they're all 16-byte aligned in memory.
The variation (the Standard Deviation of 8 separate measurements for each test) is due to the cache being rather indeterminate in which "way" it is going to invalidate on successive copies of the same data.
The different "gcc" tests are what different versions of gcc do to a simple C-based memcpy() function.
memcpy_dma() uses the DMA controller and waits for it to finish.
memcpy() is the library one. The inner loop is the old favourite from the 68000 (and PDP-11 :smileyhappy: days:
40161034: 20d9 movel %a1@+,%a0@+
40161036: 20d9 movel %a1@+,%a0@+
40161038: 20d9 movel %a1@+,%a0@+
4016103a: 20d9 movel %a1@+,%a0@+
4016103c: 5380 subql #1,%d0
4016103e: 6a00 fff4 bplw 40161034 <memcpy+0x50>
memcpy_moveml has the following inner loop:
moveq.l #16,%d1 /* d1 is constant 16 */
.L10:
movem.l (%a1),%d4-%d7 /* read a line */
adda.l %d1,%a1 /* src += 16 */
movem.l %d4-%d7,(%a0) /* write the line */
adda.l %d1,%a0 /* dest += 16 */
sub.l %d1,%d0 /* length -= 16 */
bgt.b .L10 /* loop while positive */
memcpy_moveml_32() copies 32 bytes at a time and has the following inner loop:
moveq.l #32,%d1 /* d1 is constant 32 */
.L13:
movem.l (%a1),%d4-%d7/%a2-%a5 /* read a line */
movem.l %d4-%d7/%a2-%a5,(%a0) /* write the line */
adda.l %d1,%a1 /* src += 32 */
adda.l %d1,%a0 /* dest += 32 */
sub.l %d1,%d0 /* length -= 32 */
bgt.b .L13 /* loop while positive */
memcpy_stack() is surprisingly:
uint32_t vnStackBuf[MEMCPY_STACK_SIZE + 4];
...
while (size >= 16)
{
nBurst = MIN(size, MEMCPY_STACK_SIZE);
memcpy_moveml(pStackBuf, src, nBurst);
memcpy_moveml(dst, pStackBuf, nBurst);
size -= nBurst;
src=(void *)(((char *)src) + nBurst);
dst = (void *)(((char *)dst) + nBurst);
}
memcpy_stack() is the same but calls memcpy_moveml_32.
The fastest copy functions copy from SDRAM to SRAM (the stack is in SRAM) and then repeats the copy from SRAM back to SDRAM. This has the CPU doing double the number of operations, but ends up faster as it seems to keep the SDRAM controller on the same "open page" so it isn't wasting clocks switching pages and banks.
The fastest ones also use MOVEM.L functions as they convert into direct burst memory cycles, and 32 bytes at a time are faster than 16.
Tom