i.MX6 DDR3 RAM-Performance 32 bit vs. 64 bit interface.

取消
显示结果 
显示  仅  | 搜索替代 
您的意思是: 

i.MX6 DDR3 RAM-Performance 32 bit vs. 64 bit interface.

46,849 次查看
MOW
Contributor IV

Hi all

We're currently porting Linux, Android, and Windows-CE BSPs to our own i.MX6-based board series, which can be equipped with either i.MX6Q, i.MX6D, i.MX6DL or i.MX6S SoCs, and either 32-bit or 64-bit DDR3 RAM in DDR3-800 and DDR3-1066 configurations connected to a single chip-select. While all combinations realized so far (i.MX6Q and i.MX6D with 64-bit DDR3-1066, and i.MX6S with 32-bit DDR3-800) work fine, all memory configurations are working stable, entire memory can be accessed properly, Freescale's DDR3 Stress Test tool runs fine just as well as more elaborate custom RAM tests and all three OSses, we can't seem to find any performance difference between 64-bit and 32-bit DDR3 configurations.

We're using hand-tuned assembler-code memset() and memcpy() functions for benchmarking, using only ARMv7 integer code (no NEON), which match up quite nicely with theoretical bandwidth values on Freescale's i.MX53 QSB board and our own i.MX53-board designs with 32-bit DDR3-800 RAM, which we used for comparison to our i.MX6-design: While DDR3-800 provides a (very) theoretical bandwidth of 3.2 GB/s, the i.MX53-SoC's internal 200 MHz, 64-bit (single-datarate) AXI-bus connecting the Cortex-A8 core to the RAM-controller apparently limits the bandwidth already to 1.6 GB/s and our benchmark actually measures

  • 0.9 GB/s when only a single DDR3 memory bank is involved
  • 1.2 GB/s when two different DDR3 memory banks on the same chip-select are involved
  • 1.4 GB/s when two DDR3 memory banks on two different chip-selects are used (tested on Freescale's i.MX53 QSB board)

Running the same benchmark on our i.MX6S-system with 32-bit DDR3-800 we measure for all tests ~10-20% higher performance, which probably can be explained by the more efficient out-of-order Cortex-A9 core compared to the in-order Cortex-A8 in i.MX53. (Maybe also again limited by the SoC-internal AXI-bus on i.MX6S?! Can't find any documentation on the internal bus-speeds in i.MX6-series...)

The same benchmark running on a single core on our i.MX6D/Q-systems with 64-bit DDR3-1066 RAM shows another ~10-20% performance increase, which seems to be caused by the higher DDR3 clock (533 MHz instead of 400 MHz). Overclocking the i.MX6S-system to 32-bit DDR3-1066 or underclocking the i.MX6D/Q-system to 64-bit DDR3-800 results in pretty much identical performance values. 32-bit vs. 64-bit doesn't show a difference at all !!!

Even if we run the same benchmarks on multiple CPU-cores in parallel on the i.MX6D/Q systems the aggregated performance of all two/four cores together only adds up to ~1.8-2.0GB/s bandwidth with DDR3-1066.

During all benchmarks

  • IPU, VPU, GPUs and any other bus-masters in the system were turned-off,
  • all L1-caches and the L2-cache were running,
  • the SCU of the Cortex-A9 MPCore complex was enabled,
  • all performance optimizing features of the Cortex-A9/L2C310-combination (instruction and data prefetching, early BRESP, full line of zero, etc.) were turned-on on both sides,
  • each A9-core ran its instance of the benchmarks on its own separate DDR3 memory bank untouched by any of the other cores (bank-interleaving turned-off), so there shouldn't be any thrashing w.r.t. to the page open/close-policy of the RAM-controller with multiple cores running at the same time. Enabling bank-interleaving showed even slightly lower performance,
  • the CPU-cores were running at 792 MHz.
  • our memcpy()-function is based on ARM's sample code "5. Load-Multiple memory copy with preload" shown here: ARM Information Center
  • Implementing the "6. NEON memory copy with preload"-sample from ARM Information Center instead, which according to ARM seems to be the fastest way to copy (at least on a Cortex-A8), didn't show any performance difference at all, i.e. the copy-function itself doesn't seem to be the limiting factor here.
  • Doesn't seem to be an issue with thrashing in the L2-cache either:
    • 4-cores memcpy()-ing in parallel in entirely different areas of the RAM shouldn't cause any thrashing issues in a 16-way L2-cache in the first place and
    • testing with an "MP4 system lockdown"-configuration of the L2CC, as described in L2C310_r3p2-TRM section "2.3.6 Cache Lockdown (Table 2-15)", to give each A9-core its own private 256KB L2-cache shows even ~10% lower performance.

While we're not expecting to measure the full (theoretical) 64-bit DDR3-1066 bandwidth of 8.5GB/s, looking at the actually measured performance on 32-bit DDR3-800 we would be expecting to reach at least ~4.0GB/s aggregated bandwidth over 4 cores running in parallel on the 64-bit interface.

Except for the performance our 64-bit DDR3 configuration is running fine: the entire RAM can be accessed, executes any RAM-tests without any problems and all our operating systems (Linux, Android, Windows-CE) can fully use the entire RAM without any issues.

  • Is there anything else in MMDC RAM-controller, besides the DSIZ-setting in the MDCTL-register of MMDC0, that must configured to see any proper performance benefit of 64-bit RAM vs. 32-bit RAM?
  • Can anybody measure any difference between 64-bit and 32-bit DDR3 RAM interfaces on any i.MX6-board?
  • Has anybody managed to measure more than 2 GB/s RAM performance?

Kind Regards,

Marc

标签 (4)
21 回复数

30,903 次查看
TomE
Specialist II

> we can't seem to find any performance difference between 64-bit and 32-bit DDR3 configurations.

I wish I'd noticed this when originally posted. I'm not sure if the original participants are going to notice this reply, or be able to run tests or respond.

The problem I've found with these chips is trying to keep the current page "open" so you can pump data into or out of the DDR without having to waste all the clocks and access time selecting another row. A simple memcpy() on a very old CPU without modern complications (as old as a 68000) ping-pongs between the read and write addresses, hitting this delay all the time. Unrolling the loop and reading as many words as you have spare registers is the first step in making this faster. It even makes sense to read huge blocks to SRAM and then write them from there [1].

Anyway... Were you running tests with the L1 and L2 Caches disabled or controlled, or were they in the way and fully active?

As far as I can tell, on a normal system that has been running, the caches will be dirty. You can try to read a block of RAM in order, and try to keep the page open, but the CPU will have to evict a cache line before it can do the read. And a different cache line (at an unrelated address) to do the next read. So it will likely be ping-ponging between the evict-write and new-reads. It may even have to evict a line to perform the write!

So basically, the longest burst read or write you're going to get is a single cache line. Doubling the bus width halves the number of 400MHz DDR clocks in a burst, without reducing all the row setup time. So I'd only expect it to be 10% faster, if that.

There's another problem. When you go to WRITE a word, the CPU usually insists on READING the cache line first. So ignoring cache eviction, the memcpy() usually reduces to read source, read destination, write destination. So it only runs ar 2/3 of the theoretical speed. The PPC has a cache control instruction you can use to create a zeroed cache line so it won't perform the second useless read, but as far as I can tell the ARM doesn't have this.

I think the Neon instructions have some tricks where they bypass the L1 cache and access the L2 directly. That's the main reason why NEON based memcpy is faster and worth doing.

Have you tried to measure continuously reading from memory, also continuously writing (i.e. an optimised memset())? I found that the simpler ColdFire chips can burst-write a lot faster than they can burst-read, and both of those a lot faster than they can copy.

I'd also suggest you try a memory copy test where you start by flushing the data caches. Then copy a block of memory that is half of the size of the cache divided by the number of ways. That should burst-read from the DDR and into the outgoing cache lines without writing at all. Then purge the destination cache lines, invalidate the source lines, rinse and repeat. That might get you close to theoretical, and might just show a real advantage in having a 64-bit wide bus. But which cace, the L1 or L2 sizes? Try both!

Measure how long the cache operations take. The cache lines probably have to be flushed and invalidated serially, and that might take longer than you'd expect.

That still doesn't give any advantage in the wider bus in any "real-world" use of the CPU. It might illuminate whether the bandwith limitation is due to an internal bus though.

Note 1:

As a related matter, I've done a lot of work on speeding up memory copies on MCF5235 and MCF5329 Coldfire chips. They are far simpler than the ARM cores, and either don't support, or aren't set up to cache writes, so that simplifies things. I found surprisingly, that the fastest way to copy from SDRAM to SDRAM was to copy the memory TWICE. That is, to copy 1k blocks from SDRAM to internal SRAM, and then copy from there to the destination. This keeps the RAM pages open and makes the whole operation faster. It is temptimg to consider this on the i.MX53 until you actually measure the internal SRAM speed and find it is terrible. It claims to be "one clock access", but doesn't say that's the 133MHz clock and not the 800MHz one. That's still wrong as it takes NINETEEN of those clocks or 170ns to access the OCRAM. I never got an explanation for this.

i.MX53 i.RAM (OCRAM) seems very slow, test code included. 

Tom

0 项奖励
回复

30,895 次查看
TomE
Specialist II

I've just been advised of the following NXP document:

i.MX 6Dual/6Quad and i.MX
6DualPlus/6QuadPlus Applications
Processor Comparison

Document Number: EB810
Rev. 0, 03/2016

It details why the "Plus" chips "will be" higher performance than the previous ones. It doesn't contain any benchmark details or numbers.

One item that caught my attention:

Addressed errata ERR003740 (ARM errata number 752271)—to avoid causing data corruption in the
double line fill feature—and ERR003742 (ARM errata number 732672) to enable double cache line fill
operation in L2 cache controller. An abort on the second part of a double line fill can cause data corruption
of the first part.
Before addressing these errata, the double cache line fill operation had to be disabled. In this case, the
access from L2 cache to DRAM would be 32-bytes instead of 64-bytes. When 64-bit DDR3 is used, each
burst access to the DRAM is 64-bytes. By addressing these errata, 100% DRAM efficiency improvement
is achieved during L2 cache read accesses from 64-bit DDR3.

Enabling "double cache line fill" is detailed in the Errata to "under very rare circumstances lead to data corruption". For the purpose of investigation, it would be useful to enable ths "double cache line fill" and run the same benchmarks as have been run previously.

Tom

0 项奖励
回复

30,906 次查看
igorpadykov
NXP Employee
NXP Employee

Hi Marc

some public performance data can be found on link below

LMbench Benchmarks on i.MX

also you can request some data creating ticket.

Your data is close to ones obtained internally. Seems limitation

is caused by RALAT, MMDC arbitration (register MAARCR) and

internal NIC301arbitration latencies.

RALAT parameter, which is board dependant ,  can be decreased

for good boards.

Though these performance values 32-bit vs. 64-bit doesn't show a difference,

this interpretation should not be so straightforward that there is no sense to

use wider memory bus. In real OS system, where many modules (GPU,VPU,USB,SDMA)

act as bus masters, 32-bit vs. 64-bit does show a difference in overall performance.

Best regards

chip

0 项奖励
回复

30,907 次查看
MOW
Contributor IV

Hi Chip

Thanks for your answer.

We're indeed currently testing with different RALAT/WALAT settings and do see performance differences with different settings. Instead of the default RALAT/WALAT-settings of 5/1, which Freescale is using in U-Boot, we can actually use 2/0 for DDR3-800 on our boards and most of or DDR3-1066 boards (although unfortunately not all) work fine at 3/0. We will be doing some layout-optimizations for our next board-revision, this might probably allow all of our DDR3-1066 boards to work with 3/0.

I still can't see, that 64-bit DDR3 interface makes any performance difference, though. On the contrary:

  • If I understand the i.MX6 documentation correctly, both the Cortex-A9-MPCore complex and the L2CC are connected via two 64-Bit AXI-busses each running at half the CPU-Core clock to the rest of the system. Reference Manual section "12.5.5 L2 Cache and controller (PL310)" mentions that the "L2 cache also utilizes 2x AXI-64 to access the L3 memory or other SoC peripherals in a symmetric way". Although this leaves some room for speculation as no proper block diagram seems to be provided, but I guess this is supposed to mean, that the MMDC can (and will) be accessed via both ports?!. This should give the A9-MPCore complex a memory bandwidth of at least ~6.4 GB/s (with the cores running at 800 MHz), and with all 4 cores of an i.MX6Q running in parallel at the same time, there should be more than enough active bus-masters for the MMDC to chose from, but we still only see ~2.0 GB/s maximum aggregated RAM-bandwidth: with all 4 cores running simultaneously, each one only sees about 500 MB/s RAM-bandwidth even for a DDR3-1066 64-bit interface.
  • If the memory bandwidth usable by the Cortex-A9-MPCore complex is limited by some arbitration and latency issues in the NIC301 and the MMDC, but the MMDC can actually provide more bandwidth with DDR3-1066 64-Bit RAM if multiple masters are active, I'd expect the RALAT/WALAT settings to make hardly any measurable difference (as seen from a CPU-core)?!

But I will try to see which impact activating more additional bus-masters has on the measured performance.

Regards,

Marc

0 项奖励
回复

30,907 次查看
TheAdmiral
NXP Employee
NXP Employee

Hi Marc,

I'm going to jump in here with some comments, and this seemed like the best place to add them, since they are going to be along the lines of WALAT/RALAT.

I understand how you came up with a very theoretical bandwidth value of 3.2GB/s. And yes, if you specified upfront to start at a particular physcial DDR memory and conducted Reads and Write only to the same Bank/Row address (different column addresses allowed), you may be able to achieve something close to your theoretical value, assuming that you are also not conducting refreshes (as required by JEDEC).

I'm sure you are going to tell me that the above is obvious and you never were expecting to get to 3.2 GB/s, but I state the above only as a means for pointing out where you are going to get performance improvements.

The first step is in minimizing the length of the data traces. It doesn't matter so much for Writes, but for Reads, you have to account for extra time to complete the round trip. That is essentially what RALAT is doing for you. It gives you extra time to complete the data return trip from the time the controller releases the byte lane for a read to the time that the DDR has completed sending the data and it has final reached the processor pins and has been "clocked" in. Setting RALAT = 5 means you are adding five extra clocks to each 8-burst read cycle. So, bringing it down to 3 clock cycles means that you no longer waste the additonal 2 clocks. But that only works to a limit. You can't set RALAT = 2 because of the physical limitation of the layout: You simply have not told the controller to wait long enough to complete a read cycle. In other words, to get down to a point where RALAT = 2 will work for you, you are going to have to modify the layout. If you are using a Tee-Topology and the lengths of your byte lanes closely matches the length of you clock trace(s), then WALAT should = 0, and you can save that extra clock for write cases.

But that is low hanging fruit: Where else can you save extra clock cycles?

This is where you are going to have to experiment with DDR timing settings, and you are probably going to want to use more reliable DDR devices like Micron, to see if you can push the limits of their Read and Write latencies.

For refreshes, make sure that you are using the minimum JEDEC required refresh rate of 7.9 us.

The other timing parameters that may potentially help you get a performance boost are:

tCL     (CAS Read Latency)
tRFC     (Refresh Command to Active or Refresh command time)

tRCD     (Active command to internal read or write delay time)

tRP     (Precharge command period)

tRC     (Active to Active or Refresh Command period)

tRAS     (Active to Prechare Command period)

tRPA     (Precharge-All command period)

tWL     (Write recovery time)

tCWL     (CAS Write Latency)

tRTP     (Internal Read command to Precharge commnad delay)

tWTR     (Internal WRITE to READ commnad delay)

tRRD     (Active to Active command period)

RTW_SAME     (Read to write delay for same chip select)

You are using only one chip select, correct? Two chip selects adds delays, and therefore, lowers performance.

Mostly what you are looking to achieve is to minimize the time it takes to close one Active Bank/Row and open a different one. This is all overhead which takes away from performance.

That is essentially all you are going to be able to do without modifying the test code to limit the number of Active Bank/Row changes required during testing.

I really don't think the AXI pipeline itself is holding you back any.

Cheers,

Mark

0 项奖励
回复

30,907 次查看
rabeeh
Contributor II

Hi,

I'm hitting similar situation but my workload is mostly GPU related.

Below is a dump of mmdc_prof. Total read/write is ~2.4GB/sec and the bus utilization is 30% (hence hinting theoretical max around 8GB/sec).

On the other hand it shows Bus Load = 99% - what is this Bus Load? Is this the internal AXI bus? (in general where can we get source code of mmdc_prof)?

Does this mean that we are Bus Load limited? (hinting AXI)

Performance numbers per unit taken from mmdc_prof -

(UNIT) = (Read) + (Write) = (Total)

CPU = 61 + 19 = 80

IPU = 395 + 0 = 395 (refreshing 1080p50)

VPU = 290 + 146 = 436

GPU = 763 + 795 = 1558

Total = 1509 + 960 = 2469 MB/sec

MMDC new Profiling results:

***********************

Total cycles count: 528073967

Busy cycles count: 523154240

Read accesses count: 34949275

Write accesses count: 18807831

Read bytes count: 1564537198

Write bytes count: 1004157153

Avg. Read burst size: 44

Avg. Write burst size: 53

Read: 1492.06 MB/s /  Write: 957.64 MB/s  Total: 2449.70 MB/s

Utilization: 30%

Bus Load: 99%

Bytes Access: 47

30,907 次查看
vladislavkaluts
Contributor II

Hi,

Did you figure out what does mean " Bus load"? I am investigating very strange behavior on our custom board, the results for CPU are :

./mmdc_prof -m 0x00060001

MMDC new Profiling results:

***********************

Total cycles count: 528037303

Busy cycles count: 39096421

Read accesses count: 2197

Write accesses count: 1517

Read bytes count: 63976

Write bytes count: 45432

Avg. Read burst size: 29

Avg. Write burst size: 29

Read: 0.06 MB/s /  Write: 0.04 MB/s  Total: 0.10 MB/s

Utilization: 0%

Bus Load: 7%

Bytes Access: 29

MMDC new Profiling results:

***********************

Total cycles count: 528039111

Busy cycles count: 39322154

Read accesses count: 7221

Write accesses count: 2494

Read bytes count: 202392

Write bytes count: 77600

Avg. Read burst size: 28

Avg. Write burst size: 31

Read: 0.19 MB/s /  Write: 0.07 MB/s  Total: 0.27 MB/s

Utilization: 0%

Bus Load: 7%

Bytes Access: 28

Dou you have any guesses why Bus load is 7% with only 0.27MB/s ?

for IPU:

./mmdc_prof -m 0x00060004

MMDC new Profiling results:

***********************

Total cycles count: 528047999

Busy cycles count: 40319160

Read accesses count: 489640

Write accesses count: 0

Read bytes count: 31336960

Write bytes count: 0

Avg. Read burst size: 0

Avg. Write burst size: 0

Read: 29.89 MB/s /  Write: 0.00 MB/s  Total: 29.89 MB/s

Utilization: 0%

Bus Load: 0%

Bytes Access: 0

0 项奖励
回复

30,907 次查看
MOW
Contributor IV

In my RAM-benchmarking environment it seems not to be easily possible to cause enough traffic with other bus-master to see a performance impact on the RAM-performance measured by the Cortex-A9-MPCore complex, which suggests that there is indeed more bandwidth available, than the ARM-cores can see/use, but so far I couldn't impact RAM-performance on a 32bit DDR3 interface with further bus-masters, either (tried with the IPU configured for 1920x1080x32bpp @60Hz).

I've done some experiments with the MMDC profiling registers:

  • Only MADPSRx registers in MMDC0 seem to be running in both 32-bit and 64-bit modes. I assume, this is correct, as MMDC1 is linked with MMDC0 in 64-bit configuration?!
  • All RAM-traffic generated by the Cortex-A9-MPCore complex during my benchmarks seems to be distributes evenly between both ARM_S0 and ARM_S1 AXI IDs. Therefore I suppose the A9-MPCore complex, the SCU and the L2CC are all using both of their AXI channels.

What was the intended purpose of the following (commented-out) piece of code found in Freescale's U-Boot for the i.MX6?

int dram_init(void)

{

    /*

     * Switch PL301_FAST2 to DDR Dual-channel mapping

     * however this block the boot up, temperory redraw

     */

    /*

     * u32 reg = 1;

     * writel(reg, GPV0_BASE_ADDR);

     */

    gd->bd->bi_dram[0].start = PHYS_SDRAM_1;

    gd->bd->bi_dram[0].size = PHYS_SDRAM_1_SIZE;

    return 0;

}

The comment sounds like it might have something to do with the issue but enabling this piece of code does indeed seem to lock up the system. I can't find any documentation on this in the reference manual (only "Registers at offset 0x0-0xffc are reserved for internal use" in the NIC-301 chapter) and otherwise neither U-Boot nor the Linux kernel seem to touch the NIC-301 configuration (although the "NOTE"-section in the NIC-301 Overview claims that "Freescale's board support package" configures the NIC-301).

0 项奖励
回复

30,895 次查看
igorpadykov
NXP Employee
NXP Employee

Hi Marc

this code switches PL301 FAST2 to specific to LPDDR2 Dual-channel mapping

attached some info. GPV is described in ARM documentation below

AMBA® Network Interconnect (NIC-301) Technical Reference Manual

ARM Information Center

and sect.Table 45-1 "GPV ports memory allocations"

IMX6DQRM i.MX 6Dual/6Quad Applications Processor Reference Manual

Best regards

chip

0 项奖励
回复

30,895 次查看
MOW
Contributor IV

Hi Chip

I had already found in ARM's documentation that the code apparently accesses the remap register in the MX6FAST2 NIC-301, but I couldn't (and still can't) find any documentation on how the RAM-mapping actually looks like, when this code is executed compared to the default mapping.

Would it make any difference for DDR3-RAM (at least in 64-bit mode)? Is it even supposed to work with DDR3 or does the system lock-up, when executing this piece of code, because it is equipped with DDR3 instead of two-channel LPDDR2?

Regards,

Marc

0 项奖励
回复

30,895 次查看
igorpadykov
NXP Employee
NXP Employee

Hi Marc

"two-channel" concept is used only for  LPDDR2,

from p.3820 IMX6DQRM :

"The core is composed of two channels, but both channels are

only active in LPDDR2 mode. If DDR3 mode is selected,

channel1 is not activated and the MMDC communicates with

the system through AXI port0.

You can also look at "DRAM Controller Optimization for i.MX"

http://cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0170.pdf

Best regards

chip

0 项奖励
回复

30,895 次查看
MOW
Contributor IV

Hi Chip

I see. So this setting is useless for DDR3 and this is probably the reason for the lock-up.

Back to my original question: 32-biut vs. 64-bit DDR3 performance.

My tests yesterday with the IPU as additional bus-master seem not to have worked properly: the IPU wasn't really running. But now I have a setup, where the IPU is indeed running on a 1920x1080x32bpp framebuffer requiring a bandwidth of ~0.5 GB/s for screen refresh. Running our benchmarks again on this setup with single- and quad-cores on DDR3-800x32bit and DDR3-1066x64 still shows similar results:

  • When the CPU cores are accessing the same RAM-bank as the IPU, CPU RAM-performance drops pretty much exactly by the same bandwidth required for display refresh.
  • When the CPU cores are accessing different RAM-banks than the IPU, the bandwidth required for the display refresh is mostly hidden.
  • Driving the IPU-bandwidth requirement further up to 1.0 GB/s impacts CPU RAM-performance even when other RAM-banks are used for the benchmarks
  • Aggregated RAM-bandwidth over all active bus-masters (incl. the IPU) in all cases is still 2.0 - 2.5 GB/s maximum.
  • Still no measurable difference between 32-bit and 64-bit DDR3 interface.

Seems in addition to the RAM-bandwidth actually usable by the CPU-cores there is hardly any additional "unused" RAM-bandwidth available to other bus-masters even with a 64-bit DDR3 interface.

Regards,

Marc

0 项奖励
回复

30,895 次查看
MOW
Contributor IV

Probing further with my benchmarks using different setups I must correct my last posting somewhat: I now can actually measure a difference between 32-bit and 64-bit DDR3 interfaces when the IPU is used as additional bus-master but the difference is pretty low (sum of measured bandwidths used by IPU and 4 active ARM-cores running at 792 MHz):

  • DDR3-800 32-bit reaches a maximum of ~2.3 GB/s in total
  • DDR3-800 64-bit reaches a maximum of ~2.5 GB/s in total
  • DDR3-1066 32-bit reaches a maximum of ~2.5 GB/s in total
  • DDR3-1066 64-bit reaches a maximum of ~2.8 GB/s in total

For the 32-bit interfaces actual performance is pretty good compared to the purely theoretical bandwidths of 3.2 or 4.2 GB/s. While it is reassuring to finally see a performance difference between the two interface widths, which implies that the systems are configured more or less properly, the performance benefit of only ~10% is quite disappointing.

(BTW: Running the CPU-cores at 1.0 or 1.2 GHz does improve RAM bandwidth measured by a single CPU-core, but total measured RAM-bandwidths with 4 cores and 1 IPU running at the same time, as given above, stay pretty much the same.)

0 项奖励
回复

30,895 次查看
igorpadykov
NXP Employee
NXP Employee

Hi Marc

I found some modeling data showing performance for 64bit memcpy

(4MB block) as MX6D :1055.2MB/per core, MX6Q : 670.9MB/per core.

Though I do not have explanation for this.

Best regards

chip

30,895 次查看
MOW
Contributor IV

Hi Chip

This sounds quite similar to what I do and see: we're also performing memcpy() of 4MB blocks and with 2 or 4 cores running the benchmark in parallel, we see quite similar performance values.

Without additional bus-masters running we get the same results on 32-bit RAM as well, though, and with additional bus-masters running, the performance of each core drops correspondingly to "make room" for the additional bandwidth requirements. The total available bandwidth to all active bus-masters (CPU-cores, IPU, GPU, VPU, etc.) seems to be still limited to ~2.8 GB/s in total on DDR3-1066 with 64-bit interface.

This is only ~10% faster than what we can see with a 32-bit interface and therefore in comparison rather disappointing.I had hopes, we might have just missed some additional configuration settings necessary to enable proper speed-up by the doubled RAM interface-width?!

Regards,

Marc

0 项奖励
回复

30,895 次查看
igorpadykov
NXP Employee
NXP Employee

Hi Marc

modeling data showing performance for 32bit memcpy

(4MB block) as 494.4MB/per core MX6Q ( 670.9MB @ 64bit)

Best regards

chip

30,895 次查看
MOW
Contributor IV

Hi Chip

Seems I'm getting slightly faster results, then, that match your modeling data, if I compare my fastest 64-bit measurement against my slowest 32-bit measurement... ;-)

With DD3-1066F-timings (CL7), RALAT/WALAT of 3/0, all 4 CPU-cores running with 1.2 GHz at the same time, each core using its private set of 2 RAM-banks (i.e. each RAM-bank is accessed only by a single core), no other active bus-master and using 4 MB memcpy I get (for each core):

  • 64-Bit RAM-Interface, identical source and destination RAM-banks for the memcpy: 544 MB/s
  • 64-Bit RAM-Interface, different source and destination RAM-banks for the memcpy: 676 MB/s
  • 32-Bit RAM-Interface, identical source and destination RAM-banks for the memcpy: 506 MB/s
  • 32-Bit RAM-Interface, different source and destination RAM-banks for the memcpy: 613 MB/s

64-bit measurements just 10% faster than 32-bit at best...

Otherwise identical setup but additionally enabling an IPU consuming ~1.0 GB/s bandwidth (where two performance numbers are given, the first one is a measurement where at least one CPU-core accesses the same RAM-bank as the IPU and the second value is for all active bus-masters accessing different RAM-banks; as we have only one chip-select and one-rank of DDR3, we only have 8 banks available in total, so measurements with different source/destination banks for memcpy always cause one CPU-core to "collide" with the IPU on one bank).

  • 64-Bit RAM-Interface, identical source and destination RAM-banks for the memcpy: 401/459 MB/s
  • 64-Bit RAM-Interface, different source and destination RAM-banks for the memcpy: 496 MB/s
  • 32-Bit RAM-Interface, identical source and destination RAM-banks for the memcpy: 374/408 MB/s
  • 32-Bit RAM-Interface, different source and destination RAM-banks for the memcpy: 443 MB/s

Again 64-bit measurements just 12% faster than 32-bit at best...

Note that I didn't confirm if the IPU is indeed able to achieve its 1.0 GB/s configuration; I'm just optimistically assuming this here without checking any "underflow" status bits. Adding all the numbers under this (maybe too optimistic) assumption leads to a total overall RAM-bandwidth of 2.7 GB/s for 32-bit and 2.9 GB/s for 64-bit.

These numbers are ~0.1-0.2 GB/s higher than posted before, because for this benchmark now I am using the absolute fastest configuration, my current DUT is able to achieve. Final configuration (used during my previous benchmarks) will be using RALAT/WALAT of 4/0, CPU-core clock limited to 1.0 GHz, and DDR3-1066G timings (CL8 instead of CL7).

Very good performance for the 32-bit interface, rather poor performance for the 64-bit interface. Looking at these numbers and at the sensitivity of the benchmark measurements to even small timing configuration changes in the MMDC, it seems the performance limit here is indeed the MMDC itself.

But as my numbers seem to match up more or less with your modeling data, I suppose this is about all that the 64-bit DDR3 interface can achieve?!

Regards,

Marc

0 项奖励
回复

30,895 次查看
igorpadykov
NXP Employee
NXP Employee

Hi Marc

probably yes, on one customer board was observed

98% bus loading with the mmdc_prof tool

( LPDDR2 400MHz 32 bits test data, about 1500MB/s)

Best regards

chip

0 项奖励
回复

30,895 次查看
MOW
Contributor IV

Hi Chip

While we hadn't expected to see twice the performance with 64-bit DDR3 RAM, we had expected more than just ~10-12% performance improvement. Alas, if that is all the MMDC can do, that's a hard limit, then.

But as you mention a customer with 32-bit LPDDR2: is the performance difference between 2-channel LPDDR2 configuration (each channel 32-bit wide) compared to single-channel LPDDR2 similar to the performance difference between 64-bit and 32-bit DDR3 interfaces? I'm wondering because of the special NIC-301 mapping for 2-channel LPDDR2 and the second MMDC AXI-bus, which is unused in all DDR3 configurations?!

Regards,

Marc

0 项奖励
回复

30,895 次查看
igorpadykov
NXP Employee
NXP Employee

Hi Marc

MX6Q LPPD2 2 channel (2x32 interleaved) performance

is close (though lower) to 64bit DDR3 at the same frequency.

Best regards

chip

0 项奖励
回复