Does anyone have MOVEM.L-based memcpy() libraries for MCF53xx?

TomE · ‎12-28-2009

I'm using the MCF5329.

I have posted previously about the limited speed of the supplied (with the gcc compiler) library memcpy() function on this hardware. The SDRAM bus at 240MHz has a bandwidth of 128MB/s, but with the supplied copy function I'm getting a maximum of 80MB/s, and usually less.

The Coldfire 3 User Manual (from Freescale's site) says in part:

5.4.3 RAM Initialization
...

... There are various instructions to support
this function, including memory-to-memory

move instructions, or the MOVEM opcode.
The MOVEM instruction is optimized to

generate line-sized burst fetches on 0-modulo-
16 addresses, so this opcode generally

provides maximum performance.

So I should be using MOVEM.L-based library copies.

I could write my own, but it'd be better to get some debugged and optimised ones of these.

Does anyone have any good library copy routines for the Coldfire chips that use MOVEM.L instructions in the inner loops? I can't find any examples on Freescale's site.

Even better would be some that are set up to use the EDMA channels. It'd be good to start big copies running on the DMA and then get some other work done with the CPU.

Thanks for any pointers, URLs, code.

Tom

bkatt · ‎12-30-2009

/* quickly copy multiples of 16 bytes.

*/

_memcpy16:

link a6,#-16 /* save a6 and room for 4 longs */

movem.l d4-d7,(sp) /* save registers 4x4 */

move.l 8(a6),a0 /* destination */

move.l 12(a6),a1 /* source */

move.l 16(a6),d0 /* length */

moveq.l #16,d1 /* d1 is constant 16 */

.loopm:

movem.l (a1),d4-d7 /* read a line */

adda.l d1,a1 /* src += 16 */

movem.l d4-d7,(a0) /* write the line */

adda.l d1,a0 /* dest += 16 */

sub.l d1,d0 /* length -= 16 */

bgt.b .loopm /* loop while positive */

movem.l (sp),d4-d7 /* restore registers */

unlk a6

rts

View solution in original post

bkatt · ‎12-30-2009

/* quickly copy multiples of 16 bytes.

*/

_memcpy16:

link a6,#-16 /* save a6 and room for 4 longs */

movem.l d4-d7,(sp) /* save registers 4x4 */

move.l 8(a6),a0 /* destination */

move.l 12(a6),a1 /* source */

move.l 16(a6),d0 /* length */

moveq.l #16,d1 /* d1 is constant 16 */

.loopm:

movem.l (a1),d4-d7 /* read a line */

adda.l d1,a1 /* src += 16 */

movem.l d4-d7,(a0) /* write the line */

adda.l d1,a0 /* dest += 16 */

sub.l d1,d0 /* length -= 16 */

bgt.b .loopm /* loop while positive */

movem.l (sp),d4-d7 /* restore registers */

unlk a6

rts

TomE · ‎01-05-2010

bkatt wrote:
/* quickly copy multiples of 16 bytes.
*/
_memcpy16:

I've just tested this.

My MCF5329 is supposed to have a raw memory bandwidth of 128 MB/s.

That corresponds to the 80MHz SDRAM clock with 10 clocks per 16-byte

read or write (80MHz / 10 * 16). For a memory copy that should correspond

to 64MB/s copying speed.

This is what I'm measuring:

Function MB/s % of 128MB/x

memcpy 38.85 60.70%
memcpy_gcc_2_9 38.19 59.67%

memcpy_gcc_4_3    37.85 59.14%
memcpy_gcc_4_4    34.77 54.33%
memcpy_moveml     46.34 72.41%

The well written library memcpy() is getting about 39MB/s ,within the

measurement error/variability of the gcc V2.9 one I have.

BKatt's one is getting a bit over 46MB/s. That's quite an improvement.

The other ones are for code reportedly generated by GCC 2.9, 4.3 and 4.4. The code is

getting a lot worse (slower, bigger, less efficient) with time, but this only starts

affecting the benchmark results on this chip with the poor 4.4 code.

TomE · ‎04-21-2011

I've been running more tests to try and find the most efficient memory copy functions.

This is on a 240MHz MCF5329.

The SDRAM is clocked at 80MHz and can read 4 bytes per clock, so that's an "ultimate bandwidth" of 320MB/s. The CPU is theoretically 960MB/s. But the normal memcpy() can only manage about 7% of that!

One of the App Notes claims the LCDC can read the RAM at 128MB/s, which equates to 10 80MHz clocks to read 4 32-bit words, so 6 clocks of overhead for 4 working clocks.

Here's a table of memory copy functions.

Function           Min   Max   Aver StDev Max    Avg Speed
                   us    us    us    us    kb/s   kb/s
===========================================================
memcpy_gcc_4_4     4073 4246 4202 78.1 32180 31202
memcpy_gcc_4_3_O1 3788 3939 3919 53.0 34601 33448 +17%
memcpy_gcc_4_3_O2 3717 3937 3909 77.5 35262 33543
memcpy_gcc_2       3717 3935 3816 102.0 35262 34367

memcpy(131072)     3734 3915 3829 56.4 35102 34241 Reference
memcpy_moveml      3132 3305 3283 61.2 41849 39932 +17%
memcpy_dma         2993 2994 2994 0.5   43792 43783 +28%
memcpy_moveml_32   2500 2612 2543 42.1 52428 51564 +51%
memcpy_stack       2390 2475 2438 26.6 54841 53762 +57%
memcpy_stack_32    2265 2344 2317 25.2 57868 56572 +65%

The above table gives the minimum, maximum and average time to copy 128 kbytes from SDRAM to SDRAM. These measurements were conducted with interrupts and all DMA disabled. The Cache is set to write-through. All copies are multiples of 16 bytes, all aligned on 16-byte boundaries to match the cache line length. This is an artificial situation for general memory copies, but I'm copying bitmaps around, and they're all 16-byte aligned in memory.

The variation (the Standard Deviation of 8 separate measurements for each test) is due to the cache being rather indeterminate in which "way" it is going to invalidate on successive copies of the same data.

The different "gcc" tests are what different versions of gcc do to a simple C-based memcpy() function.

memcpy_dma() uses the DMA controller and waits for it to finish.

memcpy() is the library one. The inner loop is the old favourite from the 68000 (and PDP-11 :smileyhappy: days:

40161034:       20d9            movel %a1@+,%a0@+
40161036:       20d9            movel %a1@+,%a0@+
40161038:       20d9            movel %a1@+,%a0@+
4016103a:       20d9            movel %a1@+,%a0@+
4016103c:       5380            subql #1,%d0
4016103e:       6a00 fff4       bplw 40161034 <memcpy+0x50>

memcpy_moveml has the following inner loop:

moveq.l #16,%d1 /* d1 is constant 16 */
.L10:

    movem.l   (%a1),%d4-%d7   /* read a line */
   adda.l    %d1,%a1          /* src += 16 */
   movem.l   %d4-%d7,(%a0)   /* write the line */
   adda.l    %d1,%a0          /* dest += 16 */
   sub.l   %d1,%d0          /* length -= 16 */
   bgt.b   .L10           /* loop while positive */

memcpy_moveml_32() copies 32 bytes at a time and has the following inner loop:

    moveq.l   #32,%d1           /* d1 is constant 32 */
.L13:
    movem.l   (%a1),%d4-%d7/%a2-%a5   /* read a line */
   movem.l   %d4-%d7/%a2-%a5,(%a0)   /* write the line */
   adda.l   %d1,%a1                  /* src += 32 */
   adda.l   %d1,%a0                  /* dest += 32 */
   sub.l   %d1,%d0                  /* length -= 32 */
   bgt.b   .L13                   /* loop while positive */

memcpy_stack() is surprisingly:

uint32_t vnStackBuf[MEMCPY_STACK_SIZE + 4];

...

    while (size >= 16)
   {
       nBurst = MIN(size, MEMCPY_STACK_SIZE);
       memcpy_moveml(pStackBuf, src, nBurst);
       memcpy_moveml(dst, pStackBuf, nBurst);
       size -= nBurst;
       src=(void *)(((char *)src) + nBurst);
       dst = (void *)(((char *)dst) + nBurst);
   }

memcpy_stack() is the same but calls memcpy_moveml_32.

The fastest copy functions copy from SDRAM to SRAM (the stack is in SRAM) and then repeats the copy from SRAM back to SDRAM. This has the CPU doing double the number of operations, but ends up faster as it seems to keep the SDRAM controller on the same "open page" so it isn't wasting clocks switching pages and banks.

The fastest ones also use MOVEM.L functions as they convert into direct burst memory cycles, and 32 bytes at a time are faster than 16.

Tom

TomE · ‎04-27-2011

More testing:

Function            Min   Max   Aver StD Max Spd Avg   Memclk
===============================================================
memcpy_gcc_4_4      4277 4278 4277 0.5 30885 30883 41.77
memcpy_gcc_4_3_O1   3956 3958 3957 0.5 33391 33382 38.64
memcpy_gcc_4_3_O2   3956 3957 3957 0.5 33391 33385 38.64
memcpy_gcc_2        3956 3957 3956 0.4 33391 33390 38.63

memcpy(132096)      3957 3958 3957 0.5 33382 33379 38.65
memcpy_moveml       3323 3323 3323 0.0 39752 39752 32.45
memcpy_dma          3022 3023 3022 0.4 43711 43709 29.51
memcpy_moveml_32    2661 2663 2662 0.7 49641 49618 26.00
memcpy_stack        2495 2497 2497 0.8 52944 52912 24.38
memcpy_moveml_192   2443 2445 2444 0.6 54071 54052 23.87
memcpy_moveml_48    2442 2442 2442 0.0 54093 54093 23.85
memcpy_stack_48     2401 2402 2402 0.4 55017 54997 23.46
memcpy_stack_32_mis 2398 2399 2398 0.5 55085 55079 23.42
memcpy_stack_32     2396 2397 2396 0.5 55131 55125 23.40
memcpy_stack_192    2369 2371 2370 0.5 55760 55736 23.14
memcpy_moveml_96_ps 2328 2329 2328 0.4 56742 56739 22.74
memRead_stack_32    1553 1554 1554 0.5 85058 85017 15.17
memRead_moveml_32   1515 1516 1516 0.4 87192 87141 14.80
memWrite_stack_32    671   671   671 0.0 196864 196864   6.55
memWrite_moveml_32   636   637   637 0.5 207698 207535   6.22

memcpy:        Library memcpy() function
moveml:        4 register movem.l
moveml_32:     8 register movem.l
moveml_48:    12 register movem.l
moveml_96_ps: 12 register movem.l doubled-up (see code below)
moveml_192:   moveml_48 unrolled 4 times
stack:        SDRAM -> Stack (in SRAM), then Stack -> SDRAM.
Read:         Read-only test
Write:        Write-only test

Memclk: The number of memory clocks per cache line.

The last column shows how many memory clocks each copy took per cache line (16 bytes). The "Read" and "Write" tests are the most interesting. They are reading the SDRAM to registers (and throwing the result away) and likewise writing from registers to SDRAM. The SDRAM is capable of 32 bits per clock, or four clocks per cache line. It can be burst-written at about 6 clocks per cache line, quite close to theory. It can only be read at FOURTEEN clocks per cache line. Even when the CPU is trying as hard as it can, it looks like the SDRAM controller is closing the bank and precharging on every read, as this mode of operation is known to take 10 or 11 clocks per cache line.

The fastest memory copy function has this as the inner loop:

.L16:
   movem.l   (%a1),%d1-%d7/%a2-%a6   /* read first chunk */
   movem.l   %d1-%d7/%a2-%a6,(%sp)   /* write to SRAM */
   movem.l   48(%a1),%d1-%d7/%a2-%a6 /* read second chunk */
   movem.l   %d1-%d7/%a2-%a6,48(%a0) /* write second chunk */
   movem.l   (%sp),%d1-%d7/%a2-%a6   /* get first line back */
   movem.l   %d1-%d7/%a2-%a6,(%a0)   /* write FIRST chunk */
   moveq.l   #96,%d1                  /* d1 is constant 96 */
   adda.l   %d1,%a1                  /* src += 96 */
   adda.l   %d1,%a0                  /* dest += 96 */
   sub.l      %d1,%d0                  /* length -= 96 */
   bgt.b      .L16                   /* loop while positive */

It has the disadvantage that it only moves multipes of 96 bytes, and is only slightly faster (3%) than the ones that copies multples of 32 bytes via the stack in SRAM.

TomE · ‎04-27-2011

From my previous testing, theSDRAM controller seems to be able to keep the SDRAM page open for WRITE accesses, but not for READ accesses.

I've got the Crossbar parking on the CPU and have set it as the highest priority, so it shouldn't be causing a problem.

The Reference Manual states:

18.5.1.2 Read Command (READ)
When the SDRAMC receives a read request via the internal bus, it first checks the row and bank of the new access. If the address falls within the active row of an active bank, it is a page hit, and the read is issued as soon as possible (pending any delays required by previous commands). If the address is within an inactive bank, the memory controller issues an ACTV followed by the read command. If the address is not within the active row of an active bank, the memory controller issues a pre command to close the active row. Then, the SDRAMC issues ACTV to activate the necessary row and bank for the new access, followed by the read to the SDRAM.

So from the above the SDRAM controller should be able to keep the bank open, but it doesn't seem to be doing that in my tests.

Can anyone suggest what might be preventing the SDRAM controller from running at what should be "full speed"?

TomE · ‎05-06-2011

I wrote:

> From my previous testing, theSDRAM controller seems to be able to keep the

> SDRAM page open for WRITE accesses, but not for READ accesses.

Not so.

I reported my tests back to Freescale via our local rep and got a great response by the next day.

It included traces showing that the SDRAM controller does keep pages open between reads, but that SOMETHING (unknown) between the CPU and the SDRAM Controller is adding longish delays between the end of one SDRAM burst read and the start of the next one. Some 18-20 or more CPU clocks' worth of delay.

At least the tests at Freescale show that I haven't made some stupid configuraiton error somewhere that was making it run slow.

"Simple and wrong theory" suggests a maximum write speed of 320MB/s (320,000,000 and not 320 * 1024 * 1024).

Actual tests achieve about 208MB/s. Not bad.

"Simple and wrong theory" suggests a maximum read speed of less than 320MB/s with this CPU, as the core blocks waiting for the data from the previous read before starting the next one. It should be able to manage a four-burst read in 8 clocks, giving 160MB/s.

Actual tests take 13 or more clocks per burst, which means less than 98MB/s. In my tests reported previously in this thread I'm measuring less than 90MB/s.

The DMA controller isn't faster than the CPU, so that isn't an option for this, unless copies can be overlapped with other CPU activities.

Tom

TomE · ‎01-02-2010

bkatt wrote:
/* quickly copy multiples of 16 bytes.
*/
_memcpy16:
...

Thanks. Neat and efficient. I'm looking at using the above, or modifying it into a more general memcpy().

But for that I'd like to look at the ABI to know what registers are saved and so on. I note from a post from you in this forum dated October 12 as "Re: Coldfire register ABI documentation" that you said:

> Note that GCC uses something like the standard

> ABI, but with register D2 preserved by functions

> and pointers returned in D0.

I'm using gcc. Any idea where its version of the ABI is documented?

Using m68k-elf-objdump on the libc code supports the "observation" that D0, D1, A0 and A1 seem to be the temporaries, but I'd rather have it written down somewhere than be "coding by reverse engineering" on a current supported platform.

PaoloRenzo · ‎12-28-2009

This is not an answer to your question, but have you tried to modify the XBS module to play with the master priorities? The bus arbiter can make a difference to the eDMA. Also assembler and burst size can make a difference

Hope this post creates a new thread of discussion for your issue

Anyone else?

Does anyone have MOVEM.L-based memcpy() libraries for MCF53xx?

Does anyone have MOVEM.L-based memcpy() libraries for MCF53xx?

General