Low write-throughput to DDR on LS2085ARDB

clemenseisserer · ‎04-25-2016

Hello,

We are developing algorithms with streaming memory access patterns on the LS2085ARDB and constantly see lower throughput than expected. Testing memcpy()-throughput on 10MB blocks (using the assembly-optimized memcpy implementation of glibc) revealed shockingly low 2GB/s on a single A57 core on an otherwise idle system.

We digged further by modifying the glibc-memcpy assembly, specificly we tried to:

* preload src and dst memory regions with all possible compinations of prfm l1/l2 - ld/st - keep/strm (+10% throughput)

* using non-temporal loads/stores stnp / ldpn (no speedup)

* pre-allocating each dest cache-line with zero before issuing stores to its address to avoid store-misses which causes loading of cache-lines from DDR which will be overwritten later anyway, using dc zva, addressReg

... and it seems write-throughput of the A57-cores is the culprit. With prefetch I can easily reach 10GB/s when just loading the data (prfm + ldp), however issuing only stores (stp) seems to be limited to ~3.5GB/s.

Worse, simply continuously zero-filling cache-lines (implicitly evicting other cache lines) peaks at ~3.85GB/s:

1:

dc zva, x6

add x6, x6, #64

subs x2, x2, #64

b.ge 1b

Is this low throughput to be expected? I am quite suprised it takes ~4-5 A57 cores issuing nothing but stores to actually saturate the memory-bandwidth of the DDR4 interface.

Any ideas or hints regarding the bottleneck in this situation are highly appreciated.

Best regards & thank you in advance, Clemens Eisserer

lunminliang · ‎04-28-2016

Hello Clemens Eisserer,

I'd like to ask several questions on this:

1. Performance of Memcpy for 10MB data size, using single core, is 2GB/s. Please confirm.

2. Was A57 write streaming ON or OFF during the test?

3. Was Prefetching ON or OFF during the test?

Thanks.

Lunmin

clemenseisserer · ‎05-04-2016

1. Exactly, memcpy for 10mb memory regions results in 1860MB/s. The snippit at the bottom takes 5.5s.

2. / 3. We are using the defaults set when running LS2085A-EAR6, we did not modify prefetch/stream-settings.

Thanks & br, Clemens Eisserer

Test result

root@ls2085ardb:~# gcc -std=c99 -O3 memcpy_test.c -o memcpytest

root@ls2085ardb:~# time ./memcpytest

real 0m5.438s

user 0m5.430s

Test source code:

#include <stdlib.h>

#include <string.h>

#include <stdint.h>

#include <sys/mman.h>

#define MEM_SIZE (10*1024*1024)

int main(int argc, char** argv) {

uint8_t* srcmem = malloc(MEM_SIZE);

uint8_t* dstmem = malloc(MEM_SIZE);

//lock memory to avoid page faults

mlockall(MCL_CURRENT);

for(int i=0; i < 1000; i++) {

memcpy(dstmem, srcmem, MEM_SIZE);

}

lunminliang · ‎05-23-2016

Hello Clemens Eisserer,

Sorry for late.

Feedback from performance team is that around 2GB/s is about the expected performance for memcpy.

Regards

Lunmin

clemenseisserer · ‎05-09-2016

As the real culprit is write-throughput, memset is most likely a better memory-bandwidth-sentitive function to illustrate our issue:

for(int i=0; i < 1000; i++) {

memset(dstmem, 5, MEM_SIZE);

//memcpy(dstmem, srcmem, MEM_SIZE);

}

Replacing memcpy with memset, the result is ~3650mb/s, which equals with the results obtained when writing an assembly loop containing only "STP"-instructions.

Just for comparison, I obtained the following results on an AMD embedded x86 platform (DDR3):

memcpy: 5700mb/s (3.06x)

memset: 7523mb/s (2.06x)

Low write-throughput to DDR on LS2085ARDB

Low write-throughput to DDR on LS2085ARDB

QorIQ LS2 Devices