We're currently porting Linux, Android, and Windows-CE BSPs to our own i.MX6-based board series, which can be equipped with either i.MX6Q, i.MX6D, i.MX6DL or i.MX6S SoCs, and either 32-bit or 64-bit DDR3 RAM in DDR3-800 and DDR3-1066 configurations connected to a single chip-select. While all combinations realized so far (i.MX6Q and i.MX6D with 64-bit DDR3-1066, and i.MX6S with 32-bit DDR3-800) work fine, all memory configurations are working stable, entire memory can be accessed properly, Freescale's DDR3 Stress Test tool runs fine just as well as more elaborate custom RAM tests and all three OSses, we can't seem to find any performance difference between 64-bit and 32-bit DDR3 configurations.
We're using hand-tuned assembler-code memset() and memcpy() functions for benchmarking, using only ARMv7 integer code (no NEON), which match up quite nicely with theoretical bandwidth values on Freescale's i.MX53 QSB board and our own i.MX53-board designs with 32-bit DDR3-800 RAM, which we used for comparison to our i.MX6-design: While DDR3-800 provides a (very) theoretical bandwidth of 3.2 GB/s, the i.MX53-SoC's internal 200 MHz, 64-bit (single-datarate) AXI-bus connecting the Cortex-A8 core to the RAM-controller apparently limits the bandwidth already to 1.6 GB/s and our benchmark actually measures
- 0.9 GB/s when only a single DDR3 memory bank is involved
- 1.2 GB/s when two different DDR3 memory banks on the same chip-select are involved
- 1.4 GB/s when two DDR3 memory banks on two different chip-selects are used (tested on Freescale's i.MX53 QSB board)
Running the same benchmark on our i.MX6S-system with 32-bit DDR3-800 we measure for all tests ~10-20% higher performance, which probably can be explained by the more efficient out-of-order Cortex-A9 core compared to the in-order Cortex-A8 in i.MX53. (Maybe also again limited by the SoC-internal AXI-bus on i.MX6S?! Can't find any documentation on the internal bus-speeds in i.MX6-series...)
The same benchmark running on a single core on our i.MX6D/Q-systems with 64-bit DDR3-1066 RAM shows another ~10-20% performance increase, which seems to be caused by the higher DDR3 clock (533 MHz instead of 400 MHz). Overclocking the i.MX6S-system to 32-bit DDR3-1066 or underclocking the i.MX6D/Q-system to 64-bit DDR3-800 results in pretty much identical performance values. 32-bit vs. 64-bit doesn't show a difference at all !!!
Even if we run the same benchmarks on multiple CPU-cores in parallel on the i.MX6D/Q systems the aggregated performance of all two/four cores together only adds up to ~1.8-2.0GB/s bandwidth with DDR3-1066.
During all benchmarks
- IPU, VPU, GPUs and any other bus-masters in the system were turned-off,
- all L1-caches and the L2-cache were running,
- the SCU of the Cortex-A9 MPCore complex was enabled,
- all performance optimizing features of the Cortex-A9/L2C310-combination (instruction and data prefetching, early BRESP, full line of zero, etc.) were turned-on on both sides,
- each A9-core ran its instance of the benchmarks on its own separate DDR3 memory bank untouched by any of the other cores (bank-interleaving turned-off), so there shouldn't be any thrashing w.r.t. to the page open/close-policy of the RAM-controller with multiple cores running at the same time. Enabling bank-interleaving showed even slightly lower performance,
- the CPU-cores were running at 792 MHz.
- our memcpy()-function is based on ARM's sample code "5. Load-Multiple memory copy with preload" shown here: ARM Information Center
- Implementing the "6. NEON memory copy with preload"-sample from ARM Information Center instead, which according to ARM seems to be the fastest way to copy (at least on a Cortex-A8), didn't show any performance difference at all, i.e. the copy-function itself doesn't seem to be the limiting factor here.
- Doesn't seem to be an issue with thrashing in the L2-cache either:
- 4-cores memcpy()-ing in parallel in entirely different areas of the RAM shouldn't cause any thrashing issues in a 16-way L2-cache in the first place and
- testing with an "MP4 system lockdown"-configuration of the L2CC, as described in L2C310_r3p2-TRM section "2.3.6 Cache Lockdown (Table 2-15)", to give each A9-core its own private 256KB L2-cache shows even ~10% lower performance.
While we're not expecting to measure the full (theoretical) 64-bit DDR3-1066 bandwidth of 8.5GB/s, looking at the actually measured performance on 32-bit DDR3-800 we would be expecting to reach at least ~4.0GB/s aggregated bandwidth over 4 cores running in parallel on the 64-bit interface.
Except for the performance our 64-bit DDR3 configuration is running fine: the entire RAM can be accessed, executes any RAM-tests without any problems and all our operating systems (Linux, Android, Windows-CE) can fully use the entire RAM without any issues.
- Is there anything else in MMDC RAM-controller, besides the DSIZ-setting in the MDCTL-register of MMDC0, that must configured to see any proper performance benefit of 64-bit RAM vs. 32-bit RAM?
- Can anybody measure any difference between 64-bit and 32-bit DDR3 RAM interfaces on any i.MX6-board?
- Has anybody managed to measure more than 2 GB/s RAM performance?