AnsweredAssumed Answered

Memory to Memory DMA performance reference (K60 150MHz)

Question asked by Mark Butcher on Jun 16, 2014
Latest reply on Jun 25, 2014 by Hui_Ma

Hi All

 

I have a test were the following 6 operations are performed using standard memcpy() and memset() [working with a byte pointer and a loop]. On a K60 at 150MHz with M2M operations in SRAM.

1. clear 2k bytes of memory using memset() - 84us

2. perform memcpy() of 2k bytes [both arrays are long word aligned] - 109us

3. perform memcpy() of 2k bytes [source array is long word aligned, destination array is odd address aligned 0xXXXX1] - 109us

4. perform memcpy() of 2k bytes [destination array is long word aligned, source array is odd address aligned 0xXXXX1] - 109us

5. perform memcpy() of 2k bytes [source and destination arrays are odd address aligned both 0xXXXX1] - 109us

6. perform memcpy() of 2k bytes [destination and source arrays are on odd addresses - one with 0xXXXX1 and one with 0xXXXX3] - 109us

 

As expected with SW the memset() is faster since it only moves one pointer and all memcpy() variations are the same.

 

Then I tested memset() and memcpy() based on M2M DMA.

 

1. clear 2k bytes of memory using memset() - 14.4us

2. perform memcpy() of 2k bytes [both arrays are long word aligned] - 14.4us

3. perform memcpy() of 2k bytes [source array is long word aligned, destination array is odd address aligned 0xXXXX1] - 55.2us

4. perform memcpy() of 2k bytes [destination array is long word aligned, source array is odd address aligned 0xXXXX1] - 55.6us

5. perform memcpy() of 2k bytes [source and destination arrays are odd address aligned both 0xXXXX1] - 14.4us

6. perform memcpy() of 2k bytes [destination and source arrays are on odd addresses - one with 0xXXXX1 and one with 0xXXXX3] - 28.0us

 

As can be seen, if the source and destination pointers are long word aligned the performance is maximum but in some cases, due to restrictions of DMA address alignment, byte or half-word transfers have to be used instead. I wonder whether there are any techniques to achieve greater performance? As a reference I have done the same test with an STM32F407 at 168MHz and achieved virtually identical results (just all 12% faster due to the +12% higher clock) suggesting that the DMA is already at its limit.

 

Interestingly a pure SW based solution can beat DMA in case 1. If the code is written in assembler to make use of the STMDBCS R0!, {R2-R5} type commands it can save the memset value to multiple registers and then do an instruction based burst write from these registers to SRAM - a time of 8.6us is measured in this case which I don't think that DMA will be able to match (?). This is however only possible for memset() and not memcpy()...or does anyone know how to beat it??

 

Regards

 

Mark

Outcomes