Does anyone have MOVEM.L-based memcpy() libraries for MCF53xx?

取消
显示结果 
显示  仅  | 搜索替代 
您的意思是: 
已解决

Does anyone have MOVEM.L-based memcpy() libraries for MCF53xx?

跳至解决方案
3,575 次查看
TomE
Specialist II

I'm using the MCF5329.

 

I have posted previously about the limited speed of the supplied (with the gcc compiler) library memcpy() function on this hardware. The SDRAM bus at 240MHz has a bandwidth of 128MB/s, but with the supplied copy function I'm getting a maximum of 80MB/s, and usually less.

 

The Coldfire 3 User Manual (from Freescale's site) says in part:

 

   5.4.3 RAM Initialization
...

... There are various instructions to support
this function, including memory-to-memory

move instructions, or the MOVEM opcode.
The MOVEM instruction is optimized to

generate line-sized burst fetches on 0-modulo-
16 addresses, so this opcode generally

provides maximum performance.

 

So I should be using MOVEM.L-based library copies.

 

I could write my own, but it'd be better to get some debugged and optimised ones of these.

 

 

Does anyone have any good library copy routines for the Coldfire chips that use MOVEM.L instructions in the inner loops? I can't find any examples on Freescale's site.

 

Even better would be some that are set up to use the EDMA channels. It'd be good to start big copies running on the DMA and then get some other work done with the CPU.

 

Thanks for any pointers, URLs, code.

 

Tom

标签 (1)
0 项奖励
回复
1 解答
1,666 次查看
bkatt
Contributor IV

/* quickly copy multiples of 16 bytes. 

 */

_memcpy16:

         link     a6,#-16        /* save a6 and room for 4 longs */

         movem.l  d4-d7,(sp)     /* save registers 4x4 */

         move.l   8(a6),a0       /* destination */

         move.l   12(a6),a1      /* source */

         move.l   16(a6),d0      /* length */

         moveq.l  #16,d1         /* d1 is constant 16 */

.loopm:

         movem.l  (a1),d4-d7     /* read a line */

         adda.l     d1,a1          /* src += 16 */

         movem.l  d4-d7,(a0)    /* write the line */

         adda.l     d1,a0          /* dest += 16 */

         sub.l     d1,d0        /* length -= 16 */

         bgt.b    .loopm        /* loop while positive */

          movem.l  (sp),d4-d7    /* restore registers */

         unlk     a6

         rts

 

在原帖中查看解决方案

0 项奖励
回复
8 回复数
1,667 次查看
bkatt
Contributor IV

/* quickly copy multiples of 16 bytes. 

 */

_memcpy16:

         link     a6,#-16        /* save a6 and room for 4 longs */

         movem.l  d4-d7,(sp)     /* save registers 4x4 */

         move.l   8(a6),a0       /* destination */

         move.l   12(a6),a1      /* source */

         move.l   16(a6),d0      /* length */

         moveq.l  #16,d1         /* d1 is constant 16 */

.loopm:

         movem.l  (a1),d4-d7     /* read a line */

         adda.l     d1,a1          /* src += 16 */

         movem.l  d4-d7,(a0)    /* write the line */

         adda.l     d1,a0          /* dest += 16 */

         sub.l     d1,d0        /* length -= 16 */

         bgt.b    .loopm        /* loop while positive */

          movem.l  (sp),d4-d7    /* restore registers */

         unlk     a6

         rts

 

0 项奖励
回复
1,666 次查看
TomE
Specialist II

bkatt wrote:

/* quickly copy multiples of 16 bytes. 

 */

_memcpy16:

 

I've just tested this.

 

My MCF5329 is supposed to have a raw memory bandwidth of 128 MB/s.

That corresponds to the 80MHz SDRAM clock with 10 clocks per 16-byte

read or write (80MHz / 10 * 16). For a memory copy that should correspond

to 64MB/s copying speed.

 

This is what I'm measuring:

 

Function          MB/s   % of 128MB/x

memcpy            38.85  60.70%
memcpy_gcc_2_9    38.19  59.67%

memcpy_gcc_4_3    37.85  59.14%
memcpy_gcc_4_4    34.77  54.33%
memcpy_moveml     46.34  72.41%

 

The well written library memcpy() is getting about 39MB/s ,within the

measurement error/variability of the gcc V2.9 one I have.

 

BKatt's one is getting a bit over 46MB/s. That's quite an improvement.

 

The other ones are for code reportedly generated by GCC 2.9, 4.3 and 4.4. The code is

getting a lot worse (slower, bigger, less efficient) with time, but this only starts

affecting the benchmark results on this chip with the poor 4.4 code.

 

 

0 项奖励
回复
1,665 次查看
TomE
Specialist II

I've been running more tests to try and find the most efficient memory copy functions.

 

This is on a 240MHz MCF5329.

 

The SDRAM is clocked at 80MHz and can read 4 bytes per clock, so that's an "ultimate bandwidth" of 320MB/s. The CPU is theoretically 960MB/s. But the normal memcpy() can only manage about 7% of that!

 

One of the App Notes claims the LCDC can read the RAM at 128MB/s, which equates to 10 80MHz clocks to read 4 32-bit words, so 6 clocks of overhead for 4 working clocks.

 

Here's a table of memory copy functions.

 

Function           Min   Max   Aver  StDev Max    Avg Speed
                   us    us    us    us    kb/s   kb/s
===========================================================
memcpy_gcc_4_4     4073  4246  4202  78.1  32180  31202
memcpy_gcc_4_3_O1  3788  3939  3919  53.0  34601  33448 +17%
memcpy_gcc_4_3_O2  3717  3937  3909  77.5  35262  33543
memcpy_gcc_2       3717  3935  3816  102.0 35262  34367

memcpy(131072)     3734  3915  3829  56.4  35102  34241 Reference
memcpy_moveml      3132  3305  3283  61.2  41849  39932
+17%
memcpy_dma         2993  2994  2994  0.5   43792  43783 +28%
memcpy_moveml_32   2500  2612  2543  42.1  52428  51564 +51%
memcpy_stack       2390  2475  2438  26.6  54841  53762 +57%
memcpy_stack_32    2265  2344  2317  25.2  57868  56572 +65%

 

The above table gives the minimum, maximum and average time to copy 128 kbytes from SDRAM to SDRAM. These measurements were conducted with interrupts and all DMA disabled. The Cache is set to write-through. All copies are multiples of 16 bytes, all aligned on 16-byte boundaries to match the cache line length. This is an artificial situation for general memory copies, but I'm copying bitmaps around, and they're all 16-byte aligned in memory.

 

The variation (the Standard Deviation of 8 separate measurements for each test) is due to the cache being rather indeterminate in which "way" it is going to invalidate on successive copies of the same data.

 

The different "gcc" tests are what different versions of gcc do to a simple C-based memcpy() function.

 

memcpy_dma() uses the DMA controller and waits for it to finish.

 

memcpy() is the library one. The inner loop is the old favourite from the 68000 (and PDP-11 :smileyhappy: days:

 

40161034:       20d9            movel %a1@+,%a0@+
40161036:       20d9            movel %a1@+,%a0@+
40161038:       20d9            movel %a1@+,%a0@+
4016103a:       20d9            movel %a1@+,%a0@+
4016103c:       5380            subql #1,%d0
4016103e:       6a00 fff4       bplw 40161034 <memcpy+0x50>

 

memcpy_moveml has the following inner loop:

 

    moveq.l    #16,%d1         /* d1 is constant 16 */
.L10:

    movem.l   (%a1),%d4-%d7    /* read a line */
    adda.l    %d1,%a1          /* src += 16 */
    movem.l   %d4-%d7,(%a0)    /* write the line */
    adda.l    %d1,%a0          /* dest += 16 */
    sub.l     %d1,%d0          /* length -= 16 */
    bgt.b     .L10             /* loop while positive */

memcpy_moveml_32() copies 32 bytes at a time and has the following inner loop:

 

    moveq.l    #32,%d1                 /* d1 is constant 32 */
.L13:
    movem.l   (%a1),%d4-%d7/%a2-%a5    /* read a line */
    movem.l   %d4-%d7/%a2-%a5,(%a0)    /* write the line */
    adda.l    %d1,%a1                  /* src += 32 */
    adda.l    %d1,%a0                  /* dest += 32 */
    sub.l     %d1,%d0                  /* length -= 32 */
    bgt.b     .L13                     /* loop while positive */

memcpy_stack() is surprisingly:

 

    uint32_t    vnStackBuf[MEMCPY_STACK_SIZE + 4];

    ...

    while (size >= 16)
    {
        nBurst = MIN(size, MEMCPY_STACK_SIZE);
        memcpy_moveml(pStackBuf, src, nBurst);
        memcpy_moveml(dst, pStackBuf, nBurst);
        size -= nBurst;
        src=(void *)(((char *)src) + nBurst);
        dst = (void *)(((char *)dst) + nBurst);
    }

memcpy_stack() is the same but calls memcpy_moveml_32.

 

The fastest copy functions copy from SDRAM to SRAM (the stack is in SRAM) and then repeats the copy from SRAM back to SDRAM. This has the CPU doing double the number of operations, but ends up faster as it seems to keep the SDRAM controller on the same "open page" so it isn't wasting clocks switching pages and banks.

 

The fastest ones also use MOVEM.L functions as they convert into direct burst memory cycles, and 32 bytes at a time are faster than 16.

 

Tom

 

0 项奖励
回复
1,666 次查看
TomE
Specialist II

More testing:

 

Function            Min   Max   Aver  StD  Max Spd Avg   Memclk
===============================================================
memcpy_gcc_4_4      4277  4278  4277  0.5  30885  30883  41.77
memcpy_gcc_4_3_O1   3956  3958  3957  0.5  33391  33382  38.64
memcpy_gcc_4_3_O2   3956  3957  3957  0.5  33391  33385  38.64
memcpy_gcc_2        3956  3957  3956  0.4  33391  33390  38.63

memcpy(132096)      3957  3958  3957  0.5  33382  33379  38.65
memcpy_moveml       3323  3323  3323  0.0  39752  39752  32.45
memcpy_dma          3022  3023  3022  0.4  43711  43709  29.51
memcpy_moveml_32    2661  2663  2662  0.7  49641  49618  26.00
memcpy_stack        2495  2497  2497  0.8  52944  52912  24.38
memcpy_moveml_192   2443  2445  2444  0.6  54071  54052  23.87
memcpy_moveml_48    2442  2442  2442  0.0  54093  54093  23.85
memcpy_stack_48     2401  2402  2402  0.4  55017  54997  23.46
memcpy_stack_32_mis 2398  2399  2398  0.5  55085  55079  23.42
memcpy_stack_32     2396  2397  2396  0.5  55131  55125  23.40
memcpy_stack_192    2369  2371  2370  0.5  55760  55736  23.14
memcpy_moveml_96_ps 2328  2329  2328  0.4  56742  56739  22.74
memRead_stack_32    1553  1554  1554  0.5  85058  85017  15.17
memRead_moveml_32   1515  1516  1516  0.4  87192  87141  14.80
memWrite_stack_32    671   671   671  0.0 196864 196864   6.55
memWrite_moveml_32   636   637   637  0.5 207698 207535   6.22


memcpy:        Library memcpy() function
moveml:        4 register movem.l
moveml_32:     8 register movem.l
moveml_48:    12 register movem.l
moveml_96_ps: 12 register movem.l doubled-up (see code below)
moveml_192:   moveml_48 unrolled 4 times

stack:        SDRAM -> Stack (in SRAM), then Stack -> SDRAM.
Read:         Read-only test
Write:        Write-only test

Memclk:       The number of memory clocks per cache line.


The last column shows how many memory clocks each copy took per cache line (16 bytes). The "Read" and "Write" tests are the most interesting. They are reading the SDRAM to registers (and throwing the result away) and likewise writing from registers to SDRAM. The SDRAM is capable of 32 bits per clock, or four clocks per cache line. It can be burst-written at about 6 clocks per cache line, quite close to theory. It can only be read at FOURTEEN clocks per cache line. Even when the CPU is trying as hard as it can, it looks like the SDRAM controller is closing the bank and precharging on every read, as this mode of operation is known to take 10 or 11 clocks per cache line.

 

The fastest memory copy function has this as the inner loop:

 

.L16:
    movem.l    (%a1),%d1-%d7/%a2-%a6    /* read first chunk */
    movem.l    %d1-%d7/%a2-%a6,(%sp)    /* write to SRAM */
    movem.l    48(%a1),%d1-%d7/%a2-%a6  /* read second chunk */
    movem.l    %d1-%d7/%a2-%a6,48(%a0)  /* write second chunk */
    movem.l    (%sp),%d1-%d7/%a2-%a6    /* get first line back */
    movem.l    %d1-%d7/%a2-%a6,(%a0)    /* write FIRST chunk */
    moveq.l    #96,%d1                  /* d1 is constant 96 */
    adda.l     %d1,%a1                  /* src += 96 */
    adda.l     %d1,%a0                  /* dest += 96 */
    sub.l      %d1,%d0                  /* length -= 96 */
    bgt.b      .L16                     /* loop while positive */

 

It has the disadvantage that it only moves multipes of 96 bytes, and is only slightly faster (3%) than the ones that copies multples of 32 bytes via the stack in SRAM.

 

 

 

 

 

0 项奖励
回复
1,666 次查看
TomE
Specialist II

From my previous testing, theSDRAM controller seems to be able to keep the SDRAM page open for WRITE accesses, but not for READ accesses.

 

I've got the Crossbar parking on the CPU and have set it as the highest priority, so it shouldn't be causing a problem.

 

The Reference Manual states:

 

18.5.1.2 Read Command (READ)
When the SDRAMC receives a read request via the internal bus, it first checks the row and bank of the new access. If the address falls within the active row of an active bank, it is a page hit, and the read is issued as soon as possible (pending any delays required by previous commands). If the address is within an inactive bank, the memory controller issues an ACTV followed by the read command. If the address is not within the active row of an active bank, the memory controller issues a pre command to close the active row. Then, the SDRAMC issues ACTV to activate the necessary row and bank for the new access, followed by the read to the SDRAM.

 

So from the above the SDRAM controller should be able to keep the bank open, but it doesn't seem to be doing that in my tests.

 

Can anyone suggest what might be preventing the SDRAM controller from running at what should be "full speed"?

 

 

0 项奖励
回复
1,666 次查看
TomE
Specialist II

I wrote:

> From my previous testing, theSDRAM controller seems to be able to keep the

> SDRAM page open for WRITE accesses, but not for READ accesses.

 

Not so.

 

I reported my tests back to Freescale via our local rep and got a great response by the next day.

 

It included traces showing that the SDRAM controller does keep pages open between reads, but that SOMETHING (unknown) between the CPU and the SDRAM Controller is adding longish delays between the end of one SDRAM burst read and the start of the next one. Some 18-20 or more CPU clocks' worth of delay.

 

At least the tests at Freescale show that I haven't made some stupid configuraiton error somewhere that was making it run slow.

 

"Simple and wrong theory" suggests a maximum write speed of 320MB/s (320,000,000 and not 320 * 1024 * 1024).

 

Actual tests achieve about 208MB/s. Not bad.

 

"Simple and wrong theory" suggests a maximum read speed of less than 320MB/s with this CPU, as the core blocks waiting for the data from the previous read before starting the next one. It should be able to manage a four-burst read in 8 clocks, giving 160MB/s.

 

Actual tests take 13 or more clocks per burst, which means less than 98MB/s. In my tests reported previously in this thread I'm measuring less than 90MB/s.

 

The DMA controller isn't faster than the CPU, so that isn't an option for this, unless copies can be overlapped with other CPU activities.

 

Tom

 

0 项奖励
回复
1,666 次查看
TomE
Specialist II

bkatt wrote:

/* quickly copy multiples of 16 bytes. 

 */

_memcpy16:

...

 

Thanks. Neat and efficient. I'm looking at using the above, or modifying it into a more general memcpy().

 

But for that I'd like to look at the ABI to know what registers are saved and so on. I note from a post from you in this forum dated October 12 as "Re: Coldfire register ABI documentation" that you said:

 

> Note that GCC uses something like the standard

> ABI, but with register D2 preserved by functions

> and pointers returned in D0. 

 

I'm using gcc. Any idea where its version of the ABI is documented?

 

Using m68k-elf-objdump on the libc code supports the "observation" that D0, D1, A0 and A1 seem to be the temporaries, but I'd rather have it written down somewhere than be "coding by reverse engineering" on a current supported platform.

 

 

0 项奖励
回复
1,666 次查看
PaoloRenzo
Contributor V

This is not an answer to your question, but have you tried to modify the XBS module to play with the master priorities? The bus arbiter can make a difference to the eDMA. Also assembler and burst size can make a difference

 

Hope this post creates a new thread of discussion for your issue

 

Anyone else?

0 项奖励
回复