AnsweredAssumed Answered

Low write-throughput to DDR on LS2085ARDB

Question asked by Clemens Eisserer on Apr 25, 2016
Latest reply on May 23, 2016 by lunminliang



We are developing algorithms with streaming memory access patterns on the LS2085ARDB and constantly see lower throughput than expected. Testing memcpy()-throughput on 10MB blocks (using the assembly-optimized memcpy implementation of glibc) revealed shockingly low 2GB/s on a single A57 core on an otherwise idle system.


We digged further by modifying the glibc-memcpy assembly, specificly we tried to:

* preload src and dst memory regions with all possible compinations of prfm l1/l2 - ld/st - keep/strm  (+10% throughput)

* using non-temporal loads/stores stnp / ldpn (no speedup)

* pre-allocating each dest cache-line with zero before issuing stores to its address to avoid store-misses which causes loading of cache-lines from DDR which will be overwritten later anyway, using dc zva, addressReg


... and it seems write-throughput of the A57-cores is the culprit. With prefetch I can easily reach 10GB/s when just loading the data (prfm + ldp), however issuing only stores (stp) seems to be limited to ~3.5GB/s.

Worse, simply continuously zero-filling cache-lines (implicitly evicting other cache lines) peaks at ~3.85GB/s:


  dc      zva, x6

  add    x6, x6, #64

  subs    x2, x2, #64    1b


Is this low throughput to be expected? I am quite suprised it takes ~4-5 A57 cores issuing nothing but stores to actually saturate the memory-bandwidth of the DDR4 interface.


Any ideas or hints regarding the bottleneck in this situation are highly appreciated.


Best regards & thank you in advance, Clemens Eisserer