<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Low write-throughput to DDR on LS2085ARDB in Layerscape</title>
    <link>https://community.nxp.com/t5/Layerscape/Low-write-throughput-to-DDR-on-LS2085ARDB/m-p/475535#M781</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;As the real culprit is write-throughput, memset is most likely a better memory-bandwidth-sentitive function to illustrate our issue:&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp; for(int i=0; i &amp;lt; 1000; i++) {&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; memset(dstmem, 5, MEM_SIZE);&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; //memcpy(dstmem, srcmem, MEM_SIZE);&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp; }&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Replacing memcpy with memset, the result is ~3650mb/s, which equals with the results obtained when writing an assembly loop containing only "STP"-instructions.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Just for comparison, I obtained the following results on an AMD embedded x86 platform (DDR3):&lt;/P&gt;&lt;P&gt;memcpy: 5700mb/s (3.06x)&lt;/P&gt;&lt;P&gt;memset: 7523mb/s (2.06x)&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Tue, 10 May 2016 05:26:52 GMT</pubDate>
    <dc:creator>clemenseisserer</dc:creator>
    <dc:date>2016-05-10T05:26:52Z</dc:date>
    <item>
      <title>Low write-throughput to DDR on LS2085ARDB</title>
      <link>https://community.nxp.com/t5/Layerscape/Low-write-throughput-to-DDR-on-LS2085ARDB/m-p/475532#M778</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We are developing algorithms with streaming memory access patterns on the LS2085ARDB and constantly see lower throughput than expected. Testing memcpy()-throughput on 10MB blocks (using the assembly-optimized memcpy implementation of glibc) revealed shockingly low 2GB/s on a single A57 core on an otherwise idle system.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We digged further by modifying the glibc-memcpy assembly, specificly we tried to:&lt;/P&gt;&lt;P&gt;* preload src and dst memory regions with all possible compinations of prfm&lt;SPAN style="font-family: courier new,courier;"&gt; l1/l2 - ld/st - keep/strm&lt;/SPAN&gt;&amp;nbsp; (+10% throughput)&lt;/P&gt;&lt;P&gt;* using non-temporal loads/stores stnp / ldpn (no speedup)&lt;/P&gt;&lt;P&gt;* pre-allocating each dest cache-line with zero before issuing stores to its address to avoid store-misses which causes loading of cache-lines from DDR which will be overwritten later anyway, using &lt;SPAN style="font-family: courier new,courier;"&gt;dc zva, addressReg&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;... and it seems write-throughput of the A57-cores is the culprit. With prefetch I can easily reach 10GB/s when just loading the data (prfm + ldp), however issuing only stores (stp) seems to be limited to ~3.5GB/s.&lt;/P&gt;&lt;P&gt;Worse, simply continuously zero-filling cache-lines (implicitly evicting other cache lines) peaks at ~3.85GB/s:&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;1:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp; dc&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; zva, x6&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp; add&amp;nbsp;&amp;nbsp;&amp;nbsp; x6, x6, #64&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp; subs&amp;nbsp;&amp;nbsp;&amp;nbsp; x2, x2, #64&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;b.ge&amp;nbsp;&amp;nbsp;&amp;nbsp; 1b&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Is this low throughput to be expected? I am quite suprised it takes ~4-5 A57 cores issuing nothing but stores to actually saturate the memory-bandwidth of the DDR4 interface.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Any ideas or hints regarding the bottleneck in this situation are highly appreciated.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Best regards &amp;amp; thank you in advance, Clemens Eisserer&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Mon, 25 Apr 2016 17:27:23 GMT</pubDate>
      <guid>https://community.nxp.com/t5/Layerscape/Low-write-throughput-to-DDR-on-LS2085ARDB/m-p/475532#M778</guid>
      <dc:creator>clemenseisserer</dc:creator>
      <dc:date>2016-04-25T17:27:23Z</dc:date>
    </item>
    <item>
      <title>Re: Low write-throughput to DDR on LS2085ARDB</title>
      <link>https://community.nxp.com/t5/Layerscape/Low-write-throughput-to-DDR-on-LS2085ARDB/m-p/475533#M779</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hello Clemens Eisserer,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'd like to ask several questions on this:&lt;/P&gt;&lt;P&gt;1. Performance of Memcpy for 10MB data size, using single core, is 2GB/s. Please confirm.&lt;/P&gt;&lt;P&gt;2. Was A57 write streaming ON or OFF during the test?&lt;/P&gt;&lt;P&gt;3. Was Prefetching ON or OFF during the test?&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Lunmin&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 29 Apr 2016 02:46:00 GMT</pubDate>
      <guid>https://community.nxp.com/t5/Layerscape/Low-write-throughput-to-DDR-on-LS2085ARDB/m-p/475533#M779</guid>
      <dc:creator>lunminliang</dc:creator>
      <dc:date>2016-04-29T02:46:00Z</dc:date>
    </item>
    <item>
      <title>Re: Low write-throughput to DDR on LS2085ARDB</title>
      <link>https://community.nxp.com/t5/Layerscape/Low-write-throughput-to-DDR-on-LS2085ARDB/m-p/475534#M780</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;1. Exactly, memcpy for 10mb memory regions results in 1860MB/s. The snippit at the bottom takes 5.5s.&lt;/P&gt;&lt;P&gt;2. / 3. We are using the defaults set when running LS2085A-EAR6, we did not modify prefetch/stream-settings.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; br, Clemens Eisserer&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Test result&lt;/P&gt;&lt;P&gt;root@ls2085ardb:~# gcc -std=c99 -O3 memcpy_test.c -o memcpytest&lt;/P&gt;&lt;P&gt;root@ls2085ardb:~# time ./memcpytest&lt;/P&gt;&lt;P&gt;real&amp;nbsp;&amp;nbsp;&amp;nbsp; 0m5.438s&lt;/P&gt;&lt;P&gt;user&amp;nbsp;&amp;nbsp;&amp;nbsp; 0m5.430s&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Test source code:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;#include &amp;lt;stdlib.h&amp;gt;&lt;/P&gt;&lt;P&gt;#include &amp;lt;string.h&amp;gt;&lt;/P&gt;&lt;P&gt;#include &amp;lt;stdint.h&amp;gt;&lt;/P&gt;&lt;P&gt;#include &amp;lt;sys/mman.h&amp;gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;#define MEM_SIZE (10*1024*1024)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;int main(int argc, char** argv) {&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; uint8_t* srcmem = malloc(MEM_SIZE);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; uint8_t* dstmem = malloc(MEM_SIZE);&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; //lock memory to avoid page faults&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; mlockall(MCL_CURRENT);&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; for(int i=0; i &amp;lt; 1000; i++) {&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; memcpy(dstmem, srcmem, MEM_SIZE);&lt;/P&gt;&lt;P&gt;&amp;nbsp; }&lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE style="text-indent: 0px;"&gt;&lt;/PRE&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 04 May 2016 09:21:05 GMT</pubDate>
      <guid>https://community.nxp.com/t5/Layerscape/Low-write-throughput-to-DDR-on-LS2085ARDB/m-p/475534#M780</guid>
      <dc:creator>clemenseisserer</dc:creator>
      <dc:date>2016-05-04T09:21:05Z</dc:date>
    </item>
    <item>
      <title>Re: Low write-throughput to DDR on LS2085ARDB</title>
      <link>https://community.nxp.com/t5/Layerscape/Low-write-throughput-to-DDR-on-LS2085ARDB/m-p/475535#M781</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;As the real culprit is write-throughput, memset is most likely a better memory-bandwidth-sentitive function to illustrate our issue:&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp; for(int i=0; i &amp;lt; 1000; i++) {&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; memset(dstmem, 5, MEM_SIZE);&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; //memcpy(dstmem, srcmem, MEM_SIZE);&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp; }&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Replacing memcpy with memset, the result is ~3650mb/s, which equals with the results obtained when writing an assembly loop containing only "STP"-instructions.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Just for comparison, I obtained the following results on an AMD embedded x86 platform (DDR3):&lt;/P&gt;&lt;P&gt;memcpy: 5700mb/s (3.06x)&lt;/P&gt;&lt;P&gt;memset: 7523mb/s (2.06x)&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Tue, 10 May 2016 05:26:52 GMT</pubDate>
      <guid>https://community.nxp.com/t5/Layerscape/Low-write-throughput-to-DDR-on-LS2085ARDB/m-p/475535#M781</guid>
      <dc:creator>clemenseisserer</dc:creator>
      <dc:date>2016-05-10T05:26:52Z</dc:date>
    </item>
    <item>
      <title>Re: Low write-throughput to DDR on LS2085ARDB</title>
      <link>https://community.nxp.com/t5/Layerscape/Low-write-throughput-to-DDR-on-LS2085ARDB/m-p/475536#M782</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hello Clemens Eisserer,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Sorry for late.&lt;/P&gt;&lt;P&gt;Feedback from performance team is that around 2GB/s is about the expected performance for memcpy.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;&lt;P&gt;Lunmin&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Mon, 23 May 2016 09:55:40 GMT</pubDate>
      <guid>https://community.nxp.com/t5/Layerscape/Low-write-throughput-to-DDR-on-LS2085ARDB/m-p/475536#M782</guid>
      <dc:creator>lunminliang</dc:creator>
      <dc:date>2016-05-23T09:55:40Z</dc:date>
    </item>
  </channel>
</rss>

