External SDram read consistently slower than write

lpcware · ‎06-15-2016

Content originally posted in LPCWare by fguerzoni on Wed Oct 08 16:10:34 MST 2014
Hi,
I have a strange issue with LPC4357 + 256M-AS4C16M16S (16x16x4) sdram memory on custom board.
When using GPDMA:
- Writes from local ram (any bank) to external ram are at 185MByte/sec
- Reads from external ram to local ram (any bank) are at 75MByte/sec
Theoretical speed limit is 204MByte/sec (ram clock is 102MHz and bus is 16bit)

When using direct memcpy things go slowly but writes happens with double the reads speed.

No errors happens on data at all but I think I'm missing something macro because I think reads should be at least fast as writes.

I doubled checked configurations and application code without results.
Thanks in advance for any suggestion.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by eamonnh on Tue Oct 14 13:44:47 MST 2014
Hi Mike,
Thanks for your reply. I think your comment about the x32 vs x16 may well explain why I'm not seeing the extra bandwidth I was expecting.
Your idea of trying 2 simultaneous reads is very good. I'll run a test and post back if there is any gain.
Thanks - Eamonn

lpcware · ‎06-15-2016

Content originally posted in LPCWare by mch0 on Tue Oct 14 11:46:28 MST 2014
Hi eamonn

I'm interested in numbers, too :)
Unfortunately I believe right now that the gain from x16 to x32 is far less than a factor of two.
Hopefully I will stand corrected, we'll see.

As far as I do understand the UM the EMC always generates burst accesses. If so, the gain of x32 vs. x16 is only 4 EMC clocks, but the overhead (burst setup + gap) remains constant.
That's why I'm pessimistic.

What I would like to try is whether you can get rid of / shorten the gap ("dead cycles").
Can you try to set up two bus masters that will generate read accesses at the same time?
The idea is to see whether the EMC can generate back-to-back read bursts at all.
As you seem to have a LA connected to the bus, you could see a difference easily.

For a test maybe you could do this:
Set up two DMA transfers of 1000 transactions (32bit) each, one starting at an address like 0x20000000, the other at, say, 0x20000080.
Then start both channels at the same time.
My idea is this:
Both transfers will hit the same page (=active row) for half the time. While the EMC reads in one burst the other channel already generates read accesses. This second channel will stall, but at least the read access could be queued in the EMC.
So when the first burst terminates it could start the second one immediately.
After that burst the roles are reversed and the show could go on w/o gaps.

If the gaps are still present I think we can pretty much give up on trying to enhance read performance.
If the gaps vanish, there is hope :)

Mike

lpcware · ‎06-15-2016

Content originally posted in LPCWare by eamonnh on Tue Oct 14 11:20:29 MST 2014
Hi fguerzoni,
I have struggled with this issue also. I have a 32-bit SDRAM which is reaching 282 MB/sec write and only 101 MB/sec read. We can see some extra dead cycles on the EMC bus between each read access which seems to slow it down quite a bit.
I'm interested that you are achieving 75MB/sec with a 16-bit bus. Which suggests that I may be able to achieve closer to 150MB/sec with a 32-bit databus
Would you be kind enough to post your EMC configuration? I'd really like to see if I can improve my 32-bit read above the 101MB/sec I'm achieving. The part I'm using is an IS42S32160C.

Here's a snippet of my EMC config:
//nSClks NXP Description ISSI Description
emc->DYNAMICRP = tRP-1;//tRP202Precharge Command
emc->DYNAMICRAS = tRAS-1;//tRAS485MemoryActive to PrechargeCommand
emc->DYNAMICSREX= tSRX-1;//tSRX707Self Refresh Exit time.tSRX Exit Self Refresh
emc->DYNAMICAPR = tAPR-1;//tAPR?1LastDataOut to Active
emc->DYNAMICDAL = tDAL;//tDAL?4DataIn to Active
emc->DYNAMICWR = tWR-1;//tWR-2WriteRecovery
emc->DYNAMICRC = tRC-1;//tRC707ActiveCommandPeriodtRC Row cycle time
emc->DYNAMICRFC = tRFC-1;//tRFC?AutoRefreshPeriod
emc->DYNAMICXSR = tXSR-1;//tXSR707ExitSelfRefreshtSRX Exit Self Refresh
emc->DYNAMICRRD = tRRD-1;//tRRD152Active Bank A B TimetRRD Row activate to row activate delay
emc->DYNAMICMRD = tMRD-1;//tMRD-2LoadModeReg to ActiveCommandtMRS Mode Register Set cycle time.

I've also tried both RBC and BRC modes, without any apparent gain.

Thanks for bringing up this topic.
Kind regards
Eamonn

lpcware · ‎06-15-2016

Content originally posted in LPCWare by mch0 on Thu Oct 09 06:23:17 MST 2014
Hi,

I think with memcpy() the same effect is at work.
Writes to the SDRAM are combined into a buffer, then executed with a burst transaction.
Since there are four buffers, the memcpy() can run close to maximum speed.

Reads from the SDRAM are again prone to the gaps between bursts, since the EMC does not (seem to) prefetch data beyond a burst.
So most probably 3 of the 4 buffers will be unused (*) for reads, while the one in use must be emptied by the memcpy() before the next transaction is initiated by the EMC.
I thought of trying to "trick" the EMC into a prefetch by placing every 4th read "out of order" to generate an early request for the next burst, but I am unfortunately rather sure that this won't work with the M4 core (in order execution).

Another idea, if your application allows for that, would be to tightly couple program execution of the two cores on the AHB (M4 and M0) and/or the GPDMA to create interleaved read accesses. In that case the EMC would see the request for the next "burst block" earlier and could perhaps reduce the gap. A danger here is a premature end of burst.
As a last resort I would try that but it is certainly not easy to implement reliably.

(*) The EMC may use all 4 four buffers to create a cache, but this does not help in your case of a linear read.

About my own results: The board version with SDRAM isn't even layouted, so I will not be able to deliver any real numbers for the next 4 weeks :(

lpcware · ‎06-15-2016

Content originally posted in LPCWare by fguerzoni on Thu Oct 09 05:35:36 MST 2014
Hi Mike,
thank you for quick and detailed reply.
Yes, I'm very satisfied about write speed. Not much about read speed becasue it's below my requirements.
And yes, I can do nothing to change that result in term of ram and EMC settings. Performance are highly repeatable.
I preferred to not use 32 bit bus because pcb design hell about lines length and impedance. But I'm very interested in your future results, so please share when ready.
What explanation you give about the performance difference doing mcu data transfer (memcpy)? Writes perform double speed than reads.
Regards
Filippo

lpcware · ‎06-15-2016

Content originally posted in LPCWare by mch0 on Thu Oct 09 03:49:52 MST 2014
Hi,

I think this might be systematic problem, nothing you can correct or avoid.
There are several buffers (GPDMA, EMC, to some extent the active row in the SDRAM) that help a lot for writes but much less for reads.

For writes the much faster GPDMA (double clock, double width) can refill the EMC buffer while the last data is written out (burst of 8) to the active row.
Since the EMC knows already the address of the next write and also has some data available, it can generate a back-to-back write burst.

For reads the GPDMA will most probably only generate the read request for the next burst when it has read (and maybe even stored) the last word of the previous burst.
Therefore the EMC must leave a gap (in clock cycles) between the two read bursts, it just doesn't know earlier about the new request. Maybe it even closes the row, which would prolong the gap even more.

This is highly dependant on the internal stratetegy of the EMC of which we know close to nothing - the UM is not very informative here :/

But the underlying problem is still present, regardless of the relative "smartness" or "dumbness" of the EMC.

This is just my opinion, I don't have any more information than the UM provides.
I'll be interested in further results, if you have some, however.
Our next custom board shall also include an SDRAM, we'll try to get a 2Mx32 working at up to 200MHz (AS4C2M32S-5). I'm not sure we can get there, though.

Anyway, the write rate you achieve is very good, I think.

Mike

External SDram read consistently slower than write

External SDram read consistently slower than write

LPC43xx