We are seeing an intermittent issue with the DDR3 interface on our boards. Our design is similar to the Phytec implementation in that we use the same family of Micron DDR3 parts and use the simplest terminiation scheme (series resistors on the address, clock and control lines). We boot from QSPI and run from internal memory, so we arn't seeing any bootup issues currently. But, since this board and processor are new to us we've been running an extensive memory test over the DDR3 on each power up to verify that all is well.
The memory test is home brewed and performs three seperate tests. Data bit (barber pole of the data bits), Address bit (barber pole of the address bits), and random data patterns over the entire memory.
We see three different results. First, an immediate failure. The data bit test, fails at the first access at 0x8000000. Currently our software is such that when we get a data bit test failure the test stops running.
Secondly, we see a case where there is an occasional error during the random data test. In this case, our software will continue to run the test over and over. Typically we see one to zero failures per pass. Failures we see are never on the same data bit or at the same address. When we see a failure, our software immediately re-reads the same address again, but we never see an error on the second read.
Third, no failures ever. When the board is in this "state" we have let it continue to run (testing the memory) continuously over and over for days. These runs have occurred at room temp, at a rapid temperature ramp to +65C, steadily at +65C for several days, a rapid temperature ramp to minus 20C, and steadily at minus 20C for several days. A fairly punishing routine with no failues.
We don't do any sort or warm start. Each of these tests are done with a cold start (power applied to the board). We are using the DDR3 setup parameters from the Tower board MQX setup. As it turns out, these values are the same as Phytec's parameters for the same Micron memory family that we are using.
The only difference between these three testing result cases is a power cycle. The first result (the immediate fail) seems to happen a lot after power has been applied for the first time after a minute or more of being unpowered. The other two occur mostly on second or third power up attemtps. Most of the time it works flawlessly.
We had been looking very closely at the power applied to the part to see if we could find an issue. What we see is similar to what is sketched in the attached image. I sketeched this because of the huge differences in time scale. From power cycle to power cycle, there is no discernable difference in the way 3.3V and 1.2V to the Vybrid come up. The resets and DDR_1.5V come up with no notable differences at the ms time scale.
After looking at the posts https://community.freescale.com/message/336513#336513, it appears that our problem may not be related to power sequencing but to some other DDR3 settings.
Any suggestions of what to look at next?