DDR3 Intermittent Problem

johnfielden · ‎01-09-2014

We are seeing an intermittent issue with the DDR3 interface on our boards. Our design is similar to the Phytec implementation in that we use the same family of Micron DDR3 parts and use the simplest terminiation scheme (series resistors on the address, clock and control lines). We boot from QSPI and run from internal memory, so we arn't seeing any bootup issues currently. But, since this board and processor are new to us we've been running an extensive memory test over the DDR3 on each power up to verify that all is well.

The memory test is home brewed and performs three seperate tests. Data bit (barber pole of the data bits), Address bit (barber pole of the address bits), and random data patterns over the entire memory.

We see three different results. First, an immediate failure. The data bit test, fails at the first access at 0x8000000. Currently our software is such that when we get a data bit test failure the test stops running.

Secondly, we see a case where there is an occasional error during the random data test. In this case, our software will continue to run the test over and over. Typically we see one to zero failures per pass. Failures we see are never on the same data bit or at the same address. When we see a failure, our software immediately re-reads the same address again, but we never see an error on the second read.

Third, no failures ever. When the board is in this "state" we have let it continue to run (testing the memory) continuously over and over for days. These runs have occurred at room temp, at a rapid temperature ramp to +65C, steadily at +65C for several days, a rapid temperature ramp to minus 20C, and steadily at minus 20C for several days. A fairly punishing routine with no failues.

We don't do any sort or warm start. Each of these tests are done with a cold start (power applied to the board). We are using the DDR3 setup parameters from the Tower board MQX setup. As it turns out, these values are the same as Phytec's parameters for the same Micron memory family that we are using.

The only difference between these three testing result cases is a power cycle. The first result (the immediate fail) seems to happen a lot after power has been applied for the first time after a minute or more of being unpowered. The other two occur mostly on second or third power up attemtps. Most of the time it works flawlessly.

We had been looking very closely at the power applied to the part to see if we could find an issue. What we see is similar to what is sketched in the attached image. I sketeched this because of the huge differences in time scale. From power cycle to power cycle, there is no discernable difference in the way 3.3V and 1.2V to the Vybrid come up. The resets and DDR_1.5V come up with no notable differences at the ms time scale.

After looking at the posts https://community.freescale.com/message/336513#336513, it appears that our problem may not be related to power sequencing but to some other DDR3 settings.

Any suggestions of what to look at next?

naoumgitnik · ‎01-11-2014

Hello John,

BTW, from the older thread you mentioned, it is not clear if the problem similar to yours has been resolved with the setting our apps person propose.

Since you are being based on Phytec's design, have you tried to contact them, specifically Russell Robinson Jr. from that thread (also a Community member)?

Sincerely yours, Naoum Gitnik.

johnfielden · ‎01-13-2014

We will try the new setting and see. We weren't sure what version of Vybrid we had on the board. It appears to be a 1.1, but there is a secound part number that is X'd out across the face of the part, so there is some question.

johnfielden · ‎01-15-2014

The new setting did not help.

naoumgitnik · ‎01-15-2014

Hello John,

Before we start digging deeper, it makes sense to clarify several issues:

does your test run flawlessly on any other trusted platform - ours or Phytec?
how different is the DDR3 interface's section than ours, e.g. decoupling scheme? - no need to send the entire schematic, just describe the difference, please , if any.
is the DDR3 interface's layout based on any trusted example?

Sincerely yours, Naoum Gitnik.

johnfielden · ‎01-15-2014

1. The test runs flawlessly on the Tower. Booting from a pod or QSPI. I don't recall if we ran the same test on the Phytec, I think so, but that was before Christmas, so I'll have to verify that.

2. We are using the same decoupling scheme that the Phytec board uses, with one difference. Phytec uses 22 ohm series terminations on their board. We use 10 ohm, because the Micron FAE strongly suggested that 22 ohms is the wrong value. Since our tests run flawlessly on certain power up cycles, I have not considered the 10 ohm value to be wrong, but we are running low on options so I am willing to swap them out for 22 ohms resistors. The issue thus far seems to be related to the power up sequence. Power it up once, fails, power cycle, passes, power cycle again, fails.

We did just run through the entire DDR3 calibration process, and came up with the same values used on the Tower board.

The SW engineer did note during that calibration process that the default tower DDR register setup seems to include bit settings that are not documented. The Reference manual shows certain bits as reserved and always read as zero, but they are being set by the default DDR register settings, they don't read back as zero, and we do see that they affect the DDR memory test when cleared. The undocumented register bits are troubling to us.

3. The layout was not based on any prior design, but rules were established that meet the layout rules outlined in a Freescale presentation I was sent by an FAE. Those rules were reviewed and deemed acceptable by our local FAE.

Note that when the board powers up in the "good" state, the memory test runs over and over for days with no failures as the board is sugjected to temperature changes between -20 to +65C. This makes me think that the layout is reasonably good.

Another thing to note. If we use the JTAG pod to load the application and run, the test passes. We only see the intermittent issue when booting from QSPI. Which again sounds like a power up issue.

naoumgitnik · ‎01-15-2014

Hello John,

Thanks for the information, it is indeed important!

Based on it, it also looks to me like a power-up related issue.

IMO, it makes sense to closely watch behavior of the involved power rails while powering up and compare patterns of when there is a problem and there is no problem as well as compare to our Tower board, which has no such problem.

Regarding Phytec's design - it is hard to comment on something not ours... but, at least to compare apples to apples and/or lower number of variables, IMO, it makes sense to run your test on this board as well.

Sincerely yours, Naoum Gitnik.

johnfielden · ‎01-15-2014

We've been looking closely at the Tower board's power rails, compared with our board.

The 3.3V, and 1.2V look the same. The 2.5V and 1.1V, which we have no control over, look the same. The 1.5V looks the same. We're perplexed where else to look.

One difference is the AFE pins. We are not using the Video DAC, so we have the 3.3V AFE and the 1.2V AFE grounded through zero ohm resistors. We will white wire them to the appropriate rail and retry it.

Another thing to note, once the board has been powered up, if we manually reset it, it seems to always pass. We are still testing this, but again indicates a power up/reset issue.

We note that on the Tower board, only the Vybrid is exerting reset, and only for around 500 us after 3.3V rises. We have the same thing on our board.

naoumgitnik · ‎01-15-2014

Hello John,

Yes, please, "un-GND" 3.3V AFE and 1.2V AFE and connect them to the relevant digital power rails. Even if it does not help, then at least does not harm :smileywink:.
Regarding the reset duration - try extending it; you have to know how long it takes QSPI to it initialize to properly communicate with Vybrid during boot-up and keep Vybrid reset for long enough. (E.g., we bumped into this problem while booting from some SD cards, so the new Vybrid Tower board design takes this aspect into account.)

Sincerely yours, Naoum Gitnik.

johnfielden · ‎01-21-2014

Sorry, you may have mis-typed something. What should we be waiting for on the QSPI? The time it takes to it initialize? Isn't the initialization driven by the Vybrid? Or, is this some minimal time that the QSPI needs before we can start accessing it?

naoumgitnik · ‎01-21-2014

Hello John,

Yes, I mean "some minimal time that the QSPI needs before we can start accessing it".

See Power-up timing of some SDHC cards. for SDHC cards and USB memory sticks; the same requirement might be applicable to QSPI as well.

Sincerely yours, Naoum Gitnik.

johnfielden · ‎01-22-2014

We've had some developements since our last exchange.

First off, we have not done anything about the 3.3V AFE or 1.8V AFE yet. They are both still grounded. Also, we didn't look into the QSPI minimal time yet.

Instead, we went for broke and created a fully bootable version of Nucleus (the RTOS we are using) that has been altered to fully match the IO and other setting for our board. To avoid any QSPI issues, we are booting in low speed SPI mode. The only application running under nucleus is our memory test. The test was altered to log the value of PHY11, PHY27 and PHY43 with each run of the test. To be clear, the registers are read and logged after a failure or a successful test.

Previously we were booting up nucleus and over-riding the initial board settings in our application. The DDR was not being initialized until we had entered our settings, so our settting should have been in use anyway. But, now, the initial settings don't have to be updated.

We made version of this for the Tower and for our target board. The good news, is that after a day of testing we've yet to see a DDR failure on our target board.

The next question is how to interpret the PHY registers we are reading. For either the Tower or our board, we see non-zero DLL_UNLOCK_VALUE and LOCK fields. This indicates that the associated DLL is becoming unlocked and re-locking over time for both boards. The values of DLL_UNLOCK_VALUE and DLL_LOCK_VALUE are generally very similar, but do vary over time for both boards.

The only other logged difference between the tower and our board is that occasionally we do read the lock bit as unlocked. We are writing a script to parse our log files, but my eyeball tells me that PHY43 reports an unlock occasionally. Since the PHY read occurs after the test has run, I'm not sure it is telling us that the DLL was unlocked during the test or not.

Should we be concerned? Is it normal for the lock to come and go? Which calibration parameter is most likely to make the lock better?

We have other PHY questions too. The reference manual says that PHY 11 is for data slice 0, PHY 27 is for data slice 1, and PHY 43 is for data slice 2. I'm confused as to what is meant by slice. This isn't the same as channels, as there are only 2 data channels (8 bits each). So, what is a slice referring too. If I see an unlock bit in PHY43, what does that mean versus an unlock in the other two? Are there 3 DLL's at play? What is the function of each DLL? I'm guessing that each data channel has its own DLL, what is the third one for.

Thanks,

John

naoumgitnik · ‎01-22-2014

Hello John,

I discussed your problem with my colleagues, and below are their comments based on your data:

1. The locking and unlocking of the DLL is normal.

2. Power-up aspects:

Maybe you are trying to initialize the memory on your board too quickly after powering it up? There are specs for how long after power is applied to the memory before you can do anything with it.
The fact that it is related to power-up makes it seem like in general you are very close.

3. You might need to tune the DDR settings for your board different than ours. Even if you are using the exact same memory, differences in the board design can mean that some of the DDR settings need to be adjusted to center them as best as possible for your board. Timing should be verified on the receive side - then it includes the signal distortion caused by the board.

Sincerely yours, Naoum Gitnik.

jiri-b36968 · ‎01-23-2014

Hello John,

Naoum asked me to help.

1. DLL is designed to compensate temperature and voltage drift of DDR circuit. It is setting clk and dqs lines. Bit DLL_LOCK_VALUE say if DLL is locked right now. DLL_UNLOCK_VALUE say that DLL has become unlocked after being locked. Unlock is normal but it indicates that DDR lines has high jitter. I would expect that it will hapen more frequently ofter cold start on your module. In registers you can set starting values. Is set corretly propability of unlock will be less.

2. Phy setting. Did you modified DDRMC_CR154 needs to be 0x682C0000 (before was: 0x68200000) This is important for ZQ calibration. Please check all phy setting is all pins are set correctly.

3. Pins setting in IO mux. Please check IOMUXC_DDR_ registers, if pins are set correctly (CLk diferential, A and D lines CMOS input)

/Jiri

johnfielden · ‎01-23-2014

Hi, thanks for your response.

We are using the updated value for DDRMC_CR154 (the value is being set to 0x682C0000).

I will verify the IOMUX setting, but we are using whatever was part of nucleus initialization code for those setting.

We see the Tower board DLL_LOCK_VALUE and history changing each time we run the test, leaving us to believe that the Tower board is also losing lock at some point during the memory test. We don't see the actual "lock bit" unlocked with the tower. This is confusing us, as it appears that the Tower is becoming unlocked, just not when we are reading the register.

For our board, we see the same thing happening to the DLL_LOCK_VALUE and history, but we do occasionally see the lock bit unlocked too. We're not sure what this means.

Are both boards coming unlocked during the test? Why do we actually read the unlock on our board, but not on the tower?

We have followed freescale's calibration procedure, and came up with exactly the same values as the Tower is using. We calibrated the Tower too, just to verify those settings. We have tried running our tests with modified calibration values, but anything other than the standard set appears to cause errors or for the lock bit to be unlocked more often.

John

naoumgitnik · ‎01-23-2014

Hello John,

Regarding the DLL locking: the fact that you seem to be able to catch the unlock state on their board could be indicating that the DLL is having a harder time on your board for some reason.

How good is your DDR layout and stackup?

Trace impedance value?

Board material high-frequency attenuation?

Termination scheme?

Have you done any signal integrity modeling of your board design?

-- I'm thinking reflections, ringing, or cross talk on the lines might cause it. You have to verify it!

It is still unclear if you tried your test on the Phytec board (the one you copied); if you see the same problem, unlike on our board, then your design inherited its problems from there.

And it is still unclear if the problem is not in the amount of time required for DDR initialization.

John,

Remote debugging is quite difficult by itself. In this case systematic approach is the key, which I am trying to apply pointing at different aspects and trying to lower the "number of variables in the equation". I am afraid, however, we are still jumping from side to side without really prioritizing possible root causes...

Sincerely, Naoum Gitnik.

johnfielden · ‎01-28-2014

I used the freescale guidelines in the "DDR3 Routing Guide: Vybrid" V0.0 presentation.

The single ended traces are 50 ohms, and the differentials are 100 ohms. Imepdances were verifed by the board supplier. The stackup is clean with all of the signals sandwiched between reference plains. No other signals were allowed in the area, and eveything was trace length matched per the guidelines.

I attempted a Hyperlynx simulation. But, I ran into an issue where either Freescale's model, or Micron's model had an error. Both Freescale and Micron pointed a finger at the other and I ran out of time to have my boards made. I could return to the sim to see if I can get it running.

Our termination scheme matches the Phytec board, which is slightly different than the Tower. Micron approved my termination scheme, so I think I'm ok. We did go with 10 ohm series resistors instead of the 22 ohm that Phytec uses (per Micron suggestion). We are modding a board to use 22 ohms, but when we ran our memory test on the Phytec board we get the same results. No data errors with our new load, but DLL unlocking in the past, and the LOCK bit showing unlocked occasionally.

We have been running the Tower board for several days and have seen the LOCK bit unlocked. Just not as often as on our board. The Tower still shows evidence of past DLL unlocks, that change over time.

The real question is, do I have an issue? Is this normal behavior? As long as I never see data errors, does the fact that the DLLs are going unlocked an issue. The updated DLL variables are always within 1 of the newly adopted value reported in the Phy registers.

I will check the board sim on the next roll, but that depends on the models working.

naoumgitnik · ‎01-29-2014

Dear John,

If the only issue is with the LOCK bit, then we already confirmed it is normal.

Otherwise, with your detailed but still partial and not really systematic reply, I am back (sorry...) to:

"...to lower the "number of variables in the equation"" (equivalent to "compare apples to apples" in this case) - "It is still unclear if you tried your test on the Phytec board (the one you copied); if you see the same problem, unlike on our board, then your design inherited its problems from there."
"And it is still unclear if the problem is not in the amount of time required for DDR initialization."

Unfortunately, somehow the solution is not converging:

If there no way to verify the above, please, let me know why.
If you have your own debugging plan, not really related to mine, that's also OK, but then I have no idea how I will be able to help you.

Sincerely, Naoum Gitnik.

johnfielden · ‎01-29-2014

To be clear, the only thing still bothering us is the LOCK bit. We have not seen a data error lately. We are doing a round of automated power cycling and memory testing over temp. We have seen no errors at room or at +60C. We can't test cold as we don't have a board that reliably boots at -20C (I have posted another issue for that problem). But, so far I think we are in much better shape than when we started this thread.

We did run the test on the Phytec board. We found the frequency of LOCK bit being unlocked to be much higher on the sample of 1 we tested. No data errors, just the DLL unlocking.

I will talk to the SW folks about delaying the start of the DDR initialization after power on to see if that has any bearing.

The debugging plan is stalled, as we weren't sure if we still have a problem or not. I'm thinking we don't, but we want to test the thing over temp to be sure.

naoumgitnik · ‎01-29-2014

Dear John,

I am really glad you are making progress!

Is it clear now what was making the extensive DDR test fail in the past?

sincerely, Naoum Gitnik.

johnfielden · ‎01-30-2014

Not clear. Definitely something is different with the SW load. The DDR controller is getting the correct value earlier in the boot cycle than before. That's the only difference that I can see. We will continue temp testing at room temp and at 60C. Also, I will try again to run the signal integrity simulation for the next roll of the PWB.

DDR3 Intermittent Problem

DDR3 Intermittent Problem

VF6xx