Hello,
We are developing algorithms with streaming memory access patterns on the LS2085ARDB and constantly see lower throughput than expected. Testing memcpy()-throughput on 10MB blocks (using the assembly-optimized memcpy implementation of glibc) revealed shockingly low 2GB/s on a single A57 core on an otherwise idle system.
We digged further by modifying the glibc-memcpy assembly, specificly we tried to:
* preload src and dst memory regions with all possible compinations of prfm l1/l2 - ld/st - keep/strm (+10% throughput)
* using non-temporal loads/stores stnp / ldpn (no speedup)
* pre-allocating each dest cache-line with zero before issuing stores to its address to avoid store-misses which causes loading of cache-lines from DDR which will be overwritten later anyway, using dc zva, addressReg
... and it seems write-throughput of the A57-cores is the culprit. With prefetch I can easily reach 10GB/s when just loading the data (prfm + ldp), however issuing only stores (stp) seems to be limited to ~3.5GB/s.
Worse, simply continuously zero-filling cache-lines (implicitly evicting other cache lines) peaks at ~3.85GB/s:
1:
dc zva, x6
add x6, x6, #64
subs x2, x2, #64
b.ge 1b
Is this low throughput to be expected? I am quite suprised it takes ~4-5 A57 cores issuing nothing but stores to actually saturate the memory-bandwidth of the DDR4 interface.
Any ideas or hints regarding the bottleneck in this situation are highly appreciated.
Best regards & thank you in advance, Clemens Eisserer
Hello Clemens Eisserer,
I'd like to ask several questions on this:
1. Performance of Memcpy for 10MB data size, using single core, is 2GB/s. Please confirm.
2. Was A57 write streaming ON or OFF during the test?
3. Was Prefetching ON or OFF during the test?
Thanks.
Lunmin
1. Exactly, memcpy for 10mb memory regions results in 1860MB/s. The snippit at the bottom takes 5.5s.
2. / 3. We are using the defaults set when running LS2085A-EAR6, we did not modify prefetch/stream-settings.
Thanks & br, Clemens Eisserer
Test result
root@ls2085ardb:~# gcc -std=c99 -O3 memcpy_test.c -o memcpytest
root@ls2085ardb:~# time ./memcpytest
real 0m5.438s
user 0m5.430s
Test source code:
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <sys/mman.h>
#define MEM_SIZE (10*1024*1024)
int main(int argc, char** argv) {
uint8_t* srcmem = malloc(MEM_SIZE);
uint8_t* dstmem = malloc(MEM_SIZE);
//lock memory to avoid page faults
mlockall(MCL_CURRENT);
for(int i=0; i < 1000; i++) {
memcpy(dstmem, srcmem, MEM_SIZE);
}
}
Hello Clemens Eisserer,
Sorry for late.
Feedback from performance team is that around 2GB/s is about the expected performance for memcpy.
Regards
Lunmin
As the real culprit is write-throughput, memset is most likely a better memory-bandwidth-sentitive function to illustrate our issue:
for(int i=0; i < 1000; i++) {
memset(dstmem, 5, MEM_SIZE);
//memcpy(dstmem, srcmem, MEM_SIZE);
}
Replacing memcpy with memset, the result is ~3650mb/s, which equals with the results obtained when writing an assembly loop containing only "STP"-instructions.
Just for comparison, I obtained the following results on an AMD embedded x86 platform (DDR3):
memcpy: 5700mb/s (3.06x)
memset: 7523mb/s (2.06x)