DMA or memcpy

cmaryan · ‎09-16-2009

I'm trying to get DMA working reliably on a 5445x and I'm finding that I get essentially identical performance, maybe even slightly better from a simple loop based memcpy. At the moment, I'm not interested in an interrupt driven DMA, just in a fast way of copying data, so a memcpy type function works well enough. The only time DMA is notably faster is when I can get the source and destination aligned to get 16 byte transfers.

Is this consistent with what others have seen?

Regardless of the above, when using DMA, do you put the source and destination buffers outside of cached memory or do you purge the data cache before you do the transaction?

Thanks,

Chris

bkatt · ‎09-17-2009

One huge drawback of using DMA for memory copy is that such a routine will not be re-entrant and will be difficult to use safely in an interrupt service routine or a pre-emptive multitasking system like MQX.

We had a concern about memcpy being called in an ISR (supplied by Freescale) and solved it by avoiding the operation altogether in 99% of the cases rather then speeding up the copy. We also made sure that long-running ISR's can be pre-empted by higher level interrupts; this is NOT the default in code produced by CodeWarrior.

If you do feel it is necessary to speed up a non-DMA memcpy routine (hopefully you have profiled this need), on Coldfire you can ignore the alignment of the source and base your code speed-up on the alignment of the destination.

mjbcswitzerland · ‎09-17-2009

Hi

I don't think that there are any real problems with DMA based memcpy(). Of course if a copy is in progress no other code (interrupt, pre-empted task etc.) may try to use the DMA controller, but in this case it can simply fall-back to the standard memcpy() [optimised or not].

In fact there are 4 DMA controllers - if they were not used for anything else, it would be possible to have 4 DMA memcpy() transfers in parallel with no adverse effects (as long as each caller also waits for its own to terminate before continuing).

In the uTasker project we used only DMA3 for memcpy() since the others are typically used by the UARTs.

This means that the structure is simply:

uMemcpy()

{

if (!DMA_SR_BCR3) { // DMA not in use

// do DMA copy and wait for termination before returning

}

else {

// call standard memcpy()

}

We have used this in many Coldfire (V2) projects for 3 years with no known side effects. (Called from interrupt routines and 'normal' code)

Regards

Mark

bkatt · ‎09-17-2009

mjbcswitzerland wrote:

    if (!DMA_SR_BCR3) {                                  // DMA not in use
        // do DMA copy and wait for termination before returning
    }
    else {
        // call standard memcpy()
    }

To avoid a race condition after your "if" statement, the next operation on the hardware would have to set the BCR to a non-zero value. You don't show the code, but I assume this is true. If a careless programmer were to load the source or destination register before the BCR, the race condition would bite him sooner than he may think.

Also, for certain kinds of memory such as flash, don't you have to translate to a back-door address?

Your tests showed gains on a processor without cache. But the original post of tests from a different processor (maybe with cache) seemed to indicate DMA might not provide much improvement.

mjbcswitzerland · ‎09-17-2009

Hi

>>To avoid a race condition

Yes, the first line sets the size to be transferred to the register, then progresses with other configuration bits untill the transfer completes. Last command is a write to the register to clear it back to zero.

>>Also, for certain kinds of memory such as flash, don't you have to translate to a back-door address?

Yes this is also true. Any accesses to FLASH need to be via the back-door address.

Furthermore the DMA controller also needs to be granted rights to access FLASH in the appropriate configuration registers (once at start up).

Regards

Mark

PaoloRenzo · ‎09-17-2009

V4 DMA module is quite different from V2 family. Starting from the name, it's a eDMA. I agree all the DMAs have to transfer memory blocks from one mem to another without CPU intervension. The difference is in the way they perform this action. For example, 547x, with Hardvard architecture and an arbiter, with support to external memories like DDR and caches, more considerations are needed. Then a better integration nees to be considered during eDMA programming to determine the suitable configuration.

have you checked XBS priorities? Which memories are you using for the eDMA?

cmaryan · ‎09-17-2009

Thanks for the info, especially bringing up the point about using DMA in an ISR.

Regarding the ISR issue, it doesn't affect my code base, but for those considering it: I think this can be solved at least partially by using the DMA control registers in some sort of mutex scheme (if DMA is running, wait until DMA is complete...), which might kill ISR timing, but would prevent the problem. Putting the crossbar in round robin arbitration mode might help.

On that note:does anyone here run their crossbar in round robin arbitration mode or just the default fixed priority mode? I found it was necessary to put it in RR mode for the code base I'm working with (I'm porting 5272 code to the 54454), otherwise I would get some conflicts, irregardless of crossbar priority settings. (Though it's useful to point out that I haven't seen setting RR mode affect DMA speed notacibly.)

As for the v2 core timings Mark provided, we saw similar improvements using DMA on our 5272. I'm guessing the fast core and a clever DDR2 controller on the 5445x can bundle memory transactions so fast the limitation is the DDR2 bandwidth, rather than core speed, whereas the v2 core chips have the situation reversed.

cmaryan · ‎09-18-2009

I went through and made some changes to the way my DMA transactions are handled, so for anyone working with DMA on the v4 chips, here's some useful tips:

- Ensure you are using the maximum source and destination transaction size for your transaction - your dma function should start with some logic to sort what the source increment and source port size should be based on the source address and transfer length, and similar logic for the destination.

- Remember that the source is independent of the destination, the DMA module does all of the transaction size changing between the two - the above logic for the source should only depend on the transaction length, source start address and source port size. The destination depends only on the destination port size, transfer length and destination address.

- I found I was able to improve performance a fair bit by breaking my DMA transactions into unaligned and aligned chunks, at least for memory-to-memory transactions. So for each full transaction, I do a small transaction until the destination address is aligned on a 16-byte boundary (I actually just copy the bytes manually, since it's usually only a couple of bytes), then I do the main part of the transaction, but round the remaining length down to a multiple of 16 - this ensures that the destination port will be fully utilized and use 16 byte transfers, then I do another small transaction for the remaining bytes. This lets me make the most of DMA without particularly trying to align my buffers. It gets me about 20-25% more speed on average.

Still wondering if anyone out there uses round robin on their crossbar.

Message Edited by cmaryan on 2009-09-18 03:26 PM

cmaryan · ‎09-17-2009

Oh yeah, and to answer the question of what memory we are using for this DMA stuff, all of our transactions are either DDR2 to DDR2 and peripheral register (especially ATA) to/from DDR2. That is, I haven't found a good use for the SRAM yet, everything I need to access quickly is either small enough to go into a register, or so large it doesn't fit in SRAM and has to go in DDR2.

mjbcswitzerland · ‎09-16-2009

Hi Chris

http://www.utasker.com/docs/uTasker/uTaskerBenchmarks.PDF

These are some (old) benchmarks using M52235 at 40MHz, which has no cache.

Copying 1024 bytes using standard memcpy() - code running in FLASH - 359us

Copying 1024 bytes using standard memcpy() - code running from SRAM - 333us

Copying 1024 bytes using DMA (byte copies) - 175us

The uMemcpy() routine used generally with this chip switches to DMA copy when the copy length is greater than 20 bytes or else performs a standard code copy since the DMA configuration overhead cancels the advantages.

It does however show that the speed improvent is useful on the V2 MCU devices.

Regards

Mark

www.uTasker.com
- OS, TCP/IP stack, USB, device drivers and simulator for M521X, M521XX, M5221X, M5222X, M5223X, M5225X. One package does them all - "Embedding it better..."

Message Edited by mjbcswitzerland on 2009-09-17 12:08 AM

DMA or memcpy

DMA or memcpy

General