Troubleshooting DDR2 RAM problems

ynaught · ‎06-02-2015

I have a prototype design on my desk which occasionally (10 - 20% of the time) fails to initialize its DDR RAM on power-up.

Our boot code, as a matter of its power-up sequence, relies on the MCF5441x internal SRAM for its variable and stack space, in order to be able to bring up the DDR2 RAM and test it without worries.

The first test run by our "Fast RAM Test" is to write the first longword in RAM with a value of 1, then read it back. Then same test with same memory location, with a value of 2, etc., shifting the bit leftward each time.

When the module fails on boot, the very first RAM test fails. Now, when the test fails, the code displays its error on the LCD then enters a tight loop until the [external] watchdog chip asserts the hardware /RESET line. The RAM always works correctly on the 2nd boot.

Earlier today, I added a loop at the end of my DRAM init function to wait for the "DRAM initialization complete" bit in "DDR_CR27". It appears to set in every case, even when the RAM fails to test later.

void hardware_init_sdram( void) {
//#Select DDR 2x clock to PLL VCO
(*(vuint16*)(0xEC09001A)) = 0xa002;         // MISCCR2 P.10-3 10-11

//#Enable clocks for DDR Controller
(*(vuint8*)(0xFC04002D)) = 0x2E;      // PPMCR0 P.9-1, 9-4. Enable clock x2E (46), the SDRAM. Already done above.

//#Configure DDR2 deive strength 1.8V
//(*(vuint8*)(0xEC094060)) = 0x01; // MSCR_SDRAMC p.15-13, 15-24 Full strength
(*(vuint8*)(0xEC094060)) = 0x00; // MSCR_SDRAMC p.15-13, 15-24 Half strength

//   Wait( 100); // Uncomment this and ALL HELL BREAKS LOOSE. (don't do it)

// Following was suggested by Rocky to try to thwart the occasional RAM-fail-caused
// double-boot. It *might* have helped, but did not cure:
FastTimer( 100); // Already "tuned" because LCD is lit up.

(*(vuint32*)(0xFC0B8180)) = 0x00000000; // RCR
(*(vuint32*)(0xFC0B8180)) = 0x40000000; // RCR
(*(vuint32*)(0xFC0B81AC)) = 0x01030203; // PADCR

(*(vuint32*)(0xFC0B8000)) = 0x01010101; // CR00
(*(vuint32*)(0xFC0B8004)) = 0x00000101; // CR01
(*(vuint32*)(0xFC0B8008)) = 0x01010100; // CR02
(*(vuint32*)(0xFC0B800C)) = 0x01010000; // CR03
(*(vuint32*)(0xFC0B8010)) = 0x00010101; // CR04
(*(vuint32*)(0xFC0B8018)) = 0x00010100; // CR06
(*(vuint32*)(0xFC0B801C)) = 0x00000001; // CR07
(*(vuint32*)(0xFC0B8020)) = 0x01000001; // CR08
(*(vuint32*)(0xFC0B8024)) = 0x00000100; // CR09
(*(vuint32*)(0xFC0B8028)) = 0x00010001; // CR10
(*(vuint32*)(0xFC0B802C)) = 0x00000200; // CR11
(*(vuint32*)(0xFC0B8030)) = 0x01000002; // CR12
(*(vuint32*)(0xFC0B8034)) = 0x00000000; // CR13 *
(*(vuint32*)(0xFC0B8038)) = 0x00000100; // CR14
(*(vuint32*)(0xFC0B803C)) = 0x02000100; // CR15
(*(vuint32*)(0xFC0B8040)) = 0x02000407; // CR16
(*(vuint32*)(0xFC0B8044)) = 0x02030007; // CR17
(*(vuint32*)(0xFC0B8048)) = 0x02000100; // CR18
(*(vuint32*)(0xFC0B804C)) = 0x0A030203; // CR19
(*(vuint32*)(0xFC0B8050)) = 0x00020708; // CR20
(*(vuint32*)(0xFC0B8054)) = 0x00050008; // CR21
(*(vuint32*)(0xFC0B8058)) = 0x04030002; // CR22
(*(vuint32*)(0xFC0B805C)) = 0x00000004; // CR23
(*(vuint32*)(0xFC0B8060)) = 0x020A0000; // CR24
(*(vuint32*)(0xFC0B8064)) = 0x0c00000e; // CR25
(*(vuint32*)(0xFC0B8068)) = 0x00002004; // CR26
(*(vuint32*)(0xFC0B806C)) = 0x00000000; // CR27 *
(*(vuint32*)(0xFC0B8070)) = 0x00100010; // CR28
(*(vuint32*)(0xFC0B8074)) = 0x00100010; // CR29
(*(vuint32*)(0xFC0B8078)) = 0x00000000; // CR30 *
(*(vuint32*)(0xFC0B807C)) = 0x07990000; // CR31

(*(vuint32*)(0xFC0B80A0)) = 0x00000000; // CR40
(*(vuint32*)(0xFC0B80A4)) = 0x00000064; // CR41
(*(vuint32*)(0xFC0B80A8)) = 0x44520002; // CR42
(*(vuint32*)(0xFC0B80AC)) = 0x00C80023; // CR43

(*(vuint32*)(0xFC0B80B4)) = 0x0000c350; // CR45

(*(vuint32*)(0xFC0B80E0)) = 0x04000000; // CR56
(*(vuint32*)(0xFC0B80E4)) = 0x03000304; // CR57
(*(vuint32*)(0xFC0B80E8)) = 0x40040000; // CR58
(*(vuint32*)(0xFC0B80EC)) = 0xC0004004; // CR59
(*(vuint32*)(0xFC0B80F0)) = 0x0642C000; // CR60
(*(vuint32*)(0xFC0B80F4)) = 0x00000642; // CR61

//    asm( "    .balignw    4, 0x51FC"); // Pad to align with TPF (TRAPF)
(*(vuint32*)(0xFC0B8024)) = 0x01000100;

while( !(MCF_DDR_CR27 & MCF_DDR_CR27_INTSTATUS_DRAM_INIT ) )
    Wait( 1);
Wait(100);
}

Any suggestions for testing would be appreciated!

--Adam

ynaught · ‎11-18-2015

Our hardware engineer had me try setting MISCCR2[DCCBYP] = 1. Not sure why, but it fixed this problem...

View solution in original post

ynaught · ‎11-18-2015

Our hardware engineer had me try setting MISCCR2[DCCBYP] = 1. Not sure why, but it fixed this problem...

TomE · ‎11-18-2015

> Our hardware engineer had me try setting MISCCR2[DCCBYP] = 1. Not sure why, but it fixed this problem...

I'm glad you found how to fix it. Now I'm looking at the manual to try and work out why.

The Reference Manual documents the "DCC" as ... it doesn't document it at all.

It shows up in "Figure 8-1. Device Clock Connections", and "MISCCR2[DCCBYP]" gets a mention saying "0 DCC output is a duty-cycle corrected version of its input clock" and the "1" bypasses it. There's nothing else in the manual saying what it is, what it is for or why you might want to use it.

Except for "21.6 Initialization/Application Information" in the DDR chapter, which says "Issue write register commands to configure the DRAM protocols and the settings for the DCC." What DCC settings? There aren't any!

The Data Sheet lists the VCO Frequency as 240-500MHz, but also lists the DCC frequency as 300-500MHz, with the note "Required only for DDR2 memory". According to that the DCC is required for DDR2, but you've found it stops your DDR2 from working.

The DCC features in the Errata in "SECF205: Bus masters may fetch corrupt data/instructions from DDR2 memory.". That doesn't look like your problem, but it is interesting that part of the fix for this problem is to disable the DCC.

The only other mention of "DCC" in this forum is when I was complaining about this CPU's documentation previously and pasted in the above line:

https://community.freescale.com/message/89250#89250

So either nobody else is having problems with this or nobody else is using the chip, or nobody is running it at the frequencies that you are. Or something else.

Maybe the DCC is used in some other chip, some other MCF or maybe an MPC chip and it may be documented there. Searching for "DCCBYP" only finds it in the MCF5441x.

One of the QorIQ chips (T1040, see the Reference Manual) seems to have a "duty cycle corrector":

https://community.freescale.com/thread/359843

That manual mentions that it exists, but doesn't say anything more about it.

Maybe Freescale couldn't document it until the Patent was released? So maybe we have to read the Patent.

"Duty cycle corrector and duty cycle correction method" and a previous "Digital clock frequency doubler":

http://www.google.gg/patents/US7132863

http://www.freepatentsonline.com/8552778.html

http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=5642614&url=http%3A%2F%2Fieeexplore.i...

I'd guess that in your unit the DCC was "dithering" and adding different amounts of delay in an attempt to adjust the duty cycle. This was probably enough to seriously affect the DDR data timing. Then again, your DDR timing might be extremely marginal and the DCC just pushed it over the edge. Your DDR may be more stable if you changed the timing so it could tolerate whatever the DCC is doing.

Tom

TomE · ‎06-02-2015

Yes, we had that in conjunction with a worse problem that was fixed by changing the Drive Strength.

https://community.freescale.com/message/66171#66171

>> Possibly related, after initialising the SDRAM we have to perform a few "dummy reads".

>> If we don't, we can copy code from FLASH to RAM, but the first burst-read after the

>> writes will sometimes read garbage. Nothing in the data sheets saying that is needed either.

So after initialising the DRAM we do this:

/* Do a read from RAM */ /* NOTE! This is a hack until the issue is better understood! Without this read here, some boards read corrupt values out of RAM on the first burst read. If the first burst read is when the code is being executed from RAM then corrupt instructions are read and an illegal instruction occurs. */ move.l #0x10, %d0 /* Cache line length */ move.l #0x40010000, %a0 move.l #0x40040000, %a1 /* Destination NOT in the boot itself */ move.l (%a0), (%a1) add.l %d0, a0 add.l %d0, a1 move.l (%a0), (%a1) add.l %d0, a0 add.l %d0, a1 move.l (%a0), (%a1) add.l %d0, a0 add.l %d0, a1 move.l (%a0), (%a1) add.l %d0, a0 add.l %d0, a1

We never found out why this is needed. It may be that the DRAM chips need to be "exercised" to get their internal state machines or buffers working properly. I've heard a rumour that some RAM chips use "dynamic flip-flops" made with capacitors that need to be "charged up" before they work. I've also seen a case (a long time ago with very old DRAM chips) where the output levels from the data buffers would "sag, then lock up" if the read cycle was too long.

It is likely this isn't a problem when DRAM chips are used in PCs because the bootstraps have a lot more work to do, or run dummy DRAM cycles (or purges or refresh cycles or some tests) that exercise the RAM before it gets used, so the problem has never been seen on PCs so it has never been fixed.

Tom

ynaught · ‎06-03-2015

Your suggestions both sounded so plausible that I thought I might be done with this today!

Alas, "exercising" RAM did not seem to help, even when I changed it to access Fifty times (more is better right?):

void RAMCalisthenics( void *RAMStart )
{
// Exercise regime, recommended by Tom Evans (Freescale Forums) to get RAM past potential
// errors on startup:
unsigned long *pSrc, *pDst;
int i;

i = 50;
pSrc = RAMStart + 0x10000;
pDst = RAMStart + 0x40000;

while( i--) {
    *pDst = *pSrc;
    pSrc += 4; // (longwords) Cache line length
    pDst += 4;
}
}

Regarding the pin drive strength, we were already using the lowest drive strength setting (zero):

MCF_PAD_MSCR_SDRAMC = MCF_PAD_MSCR_SDRAMC_MSC_HS_DDR2; // HALF STRENGTH 1.8V DDR2 p.15-24

Just for grins, I also tried the High strength (1):

MCF_PAD_MSCR_SDRAMC = MCF_PAD_MSCR_SDRAMC_MSC_FS_DDR2; // FULL STRENGTH 1.8V DDR2 p.15-24

But that didn't fix it either. The other two values for the drive strength for SDRAMC, 2 and 3, are listed as "Reserved" in the user manual.

TomE · ‎06-03-2015

> Your suggestions both sounded so plausible that I thought I might be done with this today!

Sorry that this wasn't your problem. It was the most likely given your description.

But there's never "just one bug". There's at least THREE with DRAM on the one we were using (MCF5329).

QUICK FIX 1: But first, how about just running the DDRAM Controller and DDRAM initiialisation sequence TWICE. It didn't work the first time, so why not just run it a second time, every time? That might fix the problem without you having to understand it.

QUICK FIX 2: Try "fast and slow" power ramps. For slow ones, turn the power supply on at the wall. For fast ones, plug it into the board (or use different power supplies). See if you can get a correlation between power ramp speed and the board working or failing. Also test "brown outs". Turn it off and on for times between "really fast" and 10 seconds and see if there's a correlation.

Now for the long-winded ones.

The next one you've already done - the Drive Strength. That can be a problem with these chips, but it causes different problems, and not just "failed to boot".

So now we move to the nasty and horrible ones. This happened to us with an MCF5329. That's quite a different chip the the MCF5441X, but it might have similar problems.

There is a generic problem with a lot of DRAM controllers, and it comes down to this. What happens if the CPU is reset half-way through a DDR read cycle? If the Reset stops the DDR controller dead, then the DRAM is left in the middle of a read cycle, still driving the bus. When the CPU starts up again, the DRAM is still driving the bus, and will keep doing so until it has been properly initialised (to get it out of the read cycle and then to reset it). That's usually OK if there's nothing else on that data bus, but on the MCF5329 these pins are shared with the external FLASH the chip is trying to boot from. This is SECF045 in the MCF5329 Errata. I suggest you read it for more background. Freescale's workaround is to halve the performance of the DRAM by changing from 32-bit to 16-bit and to do likewise with the FLASH, running a "Split Bus" configuration. That halves the system performance.

Your chip doesn't share the FlexBus and DDR data bus lines so you won't get a lockup, but some weird startup state may be messing up your DDRAM init.

There is a different specific problem with the MCF5329, detailed here, keyword "undeterministic":

https://community.freescale.com/message/84836

I also have a 41-page report detailing this problem. No, don't ask. :-)

The DDRAM controller in the MCF5441x may not have this problem, but the following details should show you what to look for in case it does.

The DDR Controller pins can start up "random" until they get clocked out of those states and into legal ones. The "Reset" signal doesn't fix this. The DRAM controller isn't reset by the Reset signal. These "Random" initial states can generate a command to the DRAM that makes that chip drive the bus. You've now got the same "locked bus" problem as with SECF045. This is documented as SECF196. The recommended workaround for this is to BUFFER some of the DRAM control signals. That isn't very easy as it changes the DRAM signal timing. Freescale also made a mask change to pull the CKE signal high on power-up to try and mitigate SECF045, and that change affects the SECF196 problem. Freescale then released EB740 to provide more details.

http://cache.freescale.com/files/32bit/doc/eng_bulletin/EB740.pdf

This problem happens more often on a "brown out" rather than a stone-cold power-on. But our devices are commonly switched off and on again rapidly, so we have to deal with this.

The other fix is to make sure that the system clock is being supplied BEFORE some of the power supplies ramp up. Unfortunately the internal oscillator isn't guaranteed to run until the voltages are above the level when the signals start driving bad levels, but we crossed our fingers and made a minor circuit modification that worked for us.

We managed to demonstrate this problem on Freescale's EVB, and we found something interesting. The EVB is powered by a "Wall Wart". If you switch the Wart on at the wall the board never has a problem. This is because the "Wart" turns on really slowly, the power supplies ramp up slowly, the oscillator starts before the DDRAM controller goes crazy and it all starts OK. If we powered the EVB up by plugging the cable from the Wart into the board, the supplies ramp up fast and it gives the problem. But how would most people test the EVB? From the power switch on the wall, so they'd never see the problem. Now build the chip into something with a rapid switch-on (like in a car) and you have problems.

What we did was to change our hardware design to use a slow (30ms) power supply ramp on startup. We also added some pulldown resistors on SD_CLK and SD_CKE and on the power supply rails to discharge them faster.

Tom

Troubleshooting DDR2 RAM problems

Troubleshooting DDR2 RAM problems

General