Low write-throughput to DDR on LS2085ARDB

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Low write-throughput to DDR on LS2085ARDB

1,754 Views
clemenseisserer
Contributor II

Hello,

We are developing algorithms with streaming memory access patterns on the LS2085ARDB and constantly see lower throughput than expected. Testing memcpy()-throughput on 10MB blocks (using the assembly-optimized memcpy implementation of glibc) revealed shockingly low 2GB/s on a single A57 core on an otherwise idle system.

We digged further by modifying the glibc-memcpy assembly, specificly we tried to:

* preload src and dst memory regions with all possible compinations of prfm l1/l2 - ld/st - keep/strm  (+10% throughput)

* using non-temporal loads/stores stnp / ldpn (no speedup)

* pre-allocating each dest cache-line with zero before issuing stores to its address to avoid store-misses which causes loading of cache-lines from DDR which will be overwritten later anyway, using dc zva, addressReg

... and it seems write-throughput of the A57-cores is the culprit. With prefetch I can easily reach 10GB/s when just loading the data (prfm + ldp), however issuing only stores (stp) seems to be limited to ~3.5GB/s.

Worse, simply continuously zero-filling cache-lines (implicitly evicting other cache lines) peaks at ~3.85GB/s:

1:

  dc      zva, x6

  add    x6, x6, #64

  subs    x2, x2, #64

b.ge    1b

Is this low throughput to be expected? I am quite suprised it takes ~4-5 A57 cores issuing nothing but stores to actually saturate the memory-bandwidth of the DDR4 interface.

Any ideas or hints regarding the bottleneck in this situation are highly appreciated.

Best regards & thank you in advance, Clemens Eisserer

Labels (1)
Tags (1)
0 Kudos
Reply
4 Replies

1,434 Views
lunminliang
NXP Employee
NXP Employee

Hello Clemens Eisserer,

I'd like to ask several questions on this:

1. Performance of Memcpy for 10MB data size, using single core, is 2GB/s. Please confirm.

2. Was A57 write streaming ON or OFF during the test?

3. Was Prefetching ON or OFF during the test?

Thanks.

Lunmin

0 Kudos
Reply

1,434 Views
clemenseisserer
Contributor II

1. Exactly, memcpy for 10mb memory regions results in 1860MB/s. The snippit at the bottom takes 5.5s.

2. / 3. We are using the defaults set when running LS2085A-EAR6, we did not modify prefetch/stream-settings.

Thanks & br, Clemens Eisserer

Test result

root@ls2085ardb:~# gcc -std=c99 -O3 memcpy_test.c -o memcpytest

root@ls2085ardb:~# time ./memcpytest

real    0m5.438s

user    0m5.430s

Test source code:

#include <stdlib.h>

#include <string.h>

#include <stdint.h>

#include <sys/mman.h>

#define MEM_SIZE (10*1024*1024)

int main(int argc, char** argv) {

   uint8_t* srcmem = malloc(MEM_SIZE);

   uint8_t* dstmem = malloc(MEM_SIZE);

   //lock memory to avoid page faults

   mlockall(MCL_CURRENT);

  for(int i=0; i < 1000; i++) {

    memcpy(dstmem, srcmem, MEM_SIZE);

  }

}


					
				
			
			
				
			
			
				
			
			
			
			
			
			
		
0 Kudos
Reply

1,434 Views
lunminliang
NXP Employee
NXP Employee

Hello Clemens Eisserer,

Sorry for late.

Feedback from performance team is that around 2GB/s is about the expected performance for memcpy.

Regards

Lunmin

0 Kudos
Reply

1,434 Views
clemenseisserer
Contributor II

As the real culprit is write-throughput, memset is most likely a better memory-bandwidth-sentitive function to illustrate our issue:

  for(int i=0; i < 1000; i++) {

    memset(dstmem, 5, MEM_SIZE);

    //memcpy(dstmem, srcmem, MEM_SIZE);

  }

Replacing memcpy with memset, the result is ~3650mb/s, which equals with the results obtained when writing an assembly loop containing only "STP"-instructions.

Just for comparison, I obtained the following results on an AMD embedded x86 platform (DDR3):

memcpy: 5700mb/s (3.06x)

memset: 7523mb/s (2.06x)

0 Kudos
Reply