We are developing algorithms with streaming memory access patterns on the LS2085ARDB and constantly see lower throughput than expected. Testing memcpy()-throughput on 10MB blocks (using the assembly-optimized memcpy implementation of glibc) revealed shockingly low 2GB/s on a single A57 core on an otherwise idle system.
We digged further by modifying the glibc-memcpy assembly, specificly we tried to:
* preload src and dst memory regions with all possible compinations of prfm l1/l2 - ld/st - keep/strm (+10% throughput)
* using non-temporal loads/stores stnp / ldpn (no speedup)
* pre-allocating each dest cache-line with zero before issuing stores to its address to avoid store-misses which causes loading of cache-lines from DDR which will be overwritten later anyway, using dc zva, addressReg
... and it seems write-throughput of the A57-cores is the culprit. With prefetch I can easily reach 10GB/s when just loading the data (prfm + ldp), however issuing only stores (stp) seems to be limited to ~3.5GB/s.
Worse, simply continuously zero-filling cache-lines (implicitly evicting other cache lines) peaks at ~3.85GB/s:
dc zva, x6
add x6, x6, #64
subs x2, x2, #64
Is this low throughput to be expected? I am quite suprised it takes ~4-5 A57 cores issuing nothing but stores to actually saturate the memory-bandwidth of the DDR4 interface.
Any ideas or hints regarding the bottleneck in this situation are highly appreciated.
Best regards & thank you in advance, Clemens Eisserer