Hi Marc,
I'm going to jump in here with some comments, and this seemed like the best place to add them, since they are going to be along the lines of WALAT/RALAT.
I understand how you came up with a very theoretical bandwidth value of 3.2GB/s. And yes, if you specified upfront to start at a particular physcial DDR memory and conducted Reads and Write only to the same Bank/Row address (different column addresses allowed), you may be able to achieve something close to your theoretical value, assuming that you are also not conducting refreshes (as required by JEDEC).
I'm sure you are going to tell me that the above is obvious and you never were expecting to get to 3.2 GB/s, but I state the above only as a means for pointing out where you are going to get performance improvements.
The first step is in minimizing the length of the data traces. It doesn't matter so much for Writes, but for Reads, you have to account for extra time to complete the round trip. That is essentially what RALAT is doing for you. It gives you extra time to complete the data return trip from the time the controller releases the byte lane for a read to the time that the DDR has completed sending the data and it has final reached the processor pins and has been "clocked" in. Setting RALAT = 5 means you are adding five extra clocks to each 8-burst read cycle. So, bringing it down to 3 clock cycles means that you no longer waste the additonal 2 clocks. But that only works to a limit. You can't set RALAT = 2 because of the physical limitation of the layout: You simply have not told the controller to wait long enough to complete a read cycle. In other words, to get down to a point where RALAT = 2 will work for you, you are going to have to modify the layout. If you are using a Tee-Topology and the lengths of your byte lanes closely matches the length of you clock trace(s), then WALAT should = 0, and you can save that extra clock for write cases.
But that is low hanging fruit: Where else can you save extra clock cycles?
This is where you are going to have to experiment with DDR timing settings, and you are probably going to want to use more reliable DDR devices like Micron, to see if you can push the limits of their Read and Write latencies.
For refreshes, make sure that you are using the minimum JEDEC required refresh rate of 7.9 us.
The other timing parameters that may potentially help you get a performance boost are:
tCL (CAS Read Latency)
tRFC (Refresh Command to Active or Refresh command time)
tRCD (Active command to internal read or write delay time)
tRP (Precharge command period)
tRC (Active to Active or Refresh Command period)
tRAS (Active to Prechare Command period)
tRPA (Precharge-All command period)
tWL (Write recovery time)
tCWL (CAS Write Latency)
tRTP (Internal Read command to Precharge commnad delay)
tWTR (Internal WRITE to READ commnad delay)
tRRD (Active to Active command period)
RTW_SAME (Read to write delay for same chip select)
You are using only one chip select, correct? Two chip selects adds delays, and therefore, lowers performance.
Mostly what you are looking to achieve is to minimize the time it takes to close one Active Bank/Row and open a different one. This is all overhead which takes away from performance.
That is essentially all you are going to be able to do without modifying the test code to limit the number of Active Bank/Row changes required during testing.
I really don't think the AXI pipeline itself is holding you back any.
Cheers,
Mark