I meet a serious issue that DDR stress test failed if the temperature of DDR increase to 80 degree.
The DDR we used is Micron MT41J128M16JT-125 AIT:K which is for AUTOMOTIVE.
The default DDR in stress test tool is Micron MT41J128M16-187E.
I used the same <MX53_TO2_DDR3_LCB.inc> file to do stress test, it is ok in room temperature , but failed in high tempreture.
please see below log :
DQS calibration succeed.
RD calibration succeed.
WR calibration succeed.
t0: memcpy11 SSN test
t1: memcpy8 SSN test
Address of test1 failure: 0xC9072594
Data was: 0x00040080
But pattern was: 0x00040000
I've attached the DDR datasheet.
Could you give some advices, thx a lot.
Original Attachment has been moved to: MX53_TO2_DDR3_LCB_SMD_ARDb.inc.zip
If heating up the part causes more failures, then the drive strength is too low and
the signals may not rise/fall fast enough. It may be suggested to
run DDR test and find settings optimized for this specific board.
DDR_STRESS_TESTER_FOR_MX51_MX53 : DDR Stress tester kit for the i.MX51 and i.MX53.
Note: If this post answers your question, please click the Correct Answer button. Thank you!
Thx for your reply.
The result is just from the DDR stress test with the "DDR Stress tester kit" downloaded from freescale site you assigned.
I didn't modify the setting in <MX53_TO2_DDR3_LCB_SMD_ARDb.inc>, just use the original ones.
You told me to find settings optimized for this specific board. Could you give me more steps how to do such optimization?
How to find the best setting for the specific? Use the setting printed by tester kit like below:
We had this problem in spades a while back. It took us four months to get a fix, mainly getting the testing reliable and collecting data.
The problem was the CPU temperature for us, and not the DDR temperature.
With the Stress Test, you measure the maximum frequency the memory will run at, at different temperatures. For a nominal 400MHz memory clock, you've got a good board if it will run at 500MHz at every temperature.
Some of our boards failed at lower clock speeds at lower temperatures near -10C and at higher temperatures approaching 100C (CPU die temperature).
Weirdly, the "sensitivity" of the boards depended on their serial numbers, and as the numbers rose, successive boards got worse, then better, then worse in a cyclic manner. The only thing we could explain this with was that the boards are made on a large panel, serialised left-to-right and top-to-bottom and the PCB characteristics (thickness, impedance) changed in a regular away across the width of the panel. Where the panel had a different impedance, the boards were more sensitive to temperature.
One problem we had was that there are at least THREE "Rated Temperatures". There's the ambient temperature, the plastic CPU case temperature and the CPU Die temperature. The Automotive chips (i.MX53xA) have a "nominal maximum" junction temperature of 105C. The .MX53xD chips are 95C.
The junction-to-case thermal resistance is 4C/W, but the "Junction to Ambient, Natural Convection" figure is 23C/W. So the maximum Ambient for the Commercial "D" chips is 72C.
The only way to measure this properly is to get a good thermal connection between the top case and a large aluminium heatsink and measure the temperature eith a thermocouple on the chip under the heatsink. Then the CPU should be +4C/W from there if you're cooling the heatsink, and -4C/W if you're "heating the heatsink" to run the chip hot at a lower ambient temperature.
In our case, changing the DDR "Drive Strength" (commonly referred to as "DSE") from "5" to "6" raised the margins at low and high temperatures on the sensitive boards, and didn't affect the good ones.
There are TWENTY TWO different registers for setting the drive strength for the DDR alone. There are 22 different groups of signals, and you can have them all set to different values depending on the board layout, the number of chips and so on.
We're using RedBoot and the drive strength definitions are in the file "src/packages/hal/arm/mx53/xxxx/current/include/hal_platform_setup.h" and look like this:
#define DDR_SEL_VAL 0 #define DSE_VAL 6 #define ODT_VAL 2 #define DDR_SEL_SHIFT 25 #define ODT_SHIFT 22 #define DSE_SHIFT 19 #define DDR_INPUT_SHIFT 9 #define HYS_SHIFT 8 #define PKE_SHIFT 7 #define PUE_SHIFT 6 #define PUS_SHIFT 4 #define DDR_SEL_MASK (DDR_SEL_VAL << DDR_SEL_SHIFT) #define DSE_MASK (DSE_VAL << DSE_SHIFT) #define ODT_MASK (ODT_VAL << ODT_SHIFT) #define DQM_VAL DSE_MASK #define SDQS_VAL (ODT_MASK | DSE_MASK | (1 << PUE_SHIFT)) #define SDODT_VAL (DSE_MASK | (0 << PKE_SHIFT) | (1 << PUE_SHIFT) | (0 << PUS_SHIFT)) #define SDCLK_VAL DSE_MASK #define SDCKE_VAL ((1 << PKE_SHIFT) | (1 << PUE_SHIFT) | (0 << PUS_SHIFT)) MXC_DCD_ITEM(0x53fa8724, DDR_SEL_MASK) /* DDR_TYPE: DDR3 */ MXC_DCD_ITEM(0x53fa86f4, 0 << DDR_INPUT_SHIFT) /* DDRMODE_CTL */ MXC_DCD_ITEM(0x53fa8714, 0 << DDR_INPUT_SHIFT) /* GRP_DDRMODE */ MXC_DCD_ITEM(0x53fa86fc, 1 << PKE_SHIFT) /* GRP_DDRPKE */ MXC_DCD_ITEM(0x53fa8710, 0 << HYS_SHIFT) /* GRP_DDRHYS */ MXC_DCD_ITEM(0x53fa8708, 1 << PUE_SHIFT) /* GRP_DDRPK */ MXC_DCD_ITEM(0x53fa8584, DQM_VAL) /* DQM0 */ MXC_DCD_ITEM(0x53fa8594, DQM_VAL) /* DQM1 */ MXC_DCD_ITEM(0x53fa8560, DQM_VAL) /* DQM2 */ MXC_DCD_ITEM(0x53fa8554, DQM_VAL) /* DQM3 */ MXC_DCD_ITEM(0x53fa857c, SDQS_VAL) /* SDQS0 */ MXC_DCD_ITEM(0x53fa8590, SDQS_VAL) /* SDQS1 */ MXC_DCD_ITEM(0x53fa8568, SDQS_VAL) /* SDQS2 */ MXC_DCD_ITEM(0x53fa8558, SDQS_VAL) /* SDQS3 */ MXC_DCD_ITEM(0x53fa8580, SDODT_VAL) /* SDODT0 */ MXC_DCD_ITEM(0x53fa8578, SDCLK_VAL) /* SDCLK0 */ MXC_DCD_ITEM(0x53fa8564, SDODT_VAL) /* SDODT1 */ MXC_DCD_ITEM(0x53fa8570, SDCLK_VAL) /* SDCLK1 */ MXC_DCD_ITEM(0x53fa858c, SDCKE_VAL) /* SDCKE0 */ MXC_DCD_ITEM(0x53fa855c, SDCKE_VAL) /* SDCKE1 */ MXC_DCD_ITEM(0x53fa8574, DSE_MASK) /* DRAM_CAS */ MXC_DCD_ITEM(0x53fa8588, DSE_MASK) /* DRAM_RAS */ MXC_DCD_ITEM(0x53fa86f0, DSE_MASK) /* GRP_ADDDS */ MXC_DCD_ITEM(0x53fa8720, DSE_MASK) /* GRP_CTLDS */ MXC_DCD_ITEM(0x53fa8718, DSE_MASK) /* GRP_B0DS */ MXC_DCD_ITEM(0x53fa871c, DSE_MASK) /* GRP_B1DS */ MXC_DCD_ITEM(0x53fa8728, DSE_MASK) /* GRP_B2DS */ MXC_DCD_ITEM(0x53fa872c, DSE_MASK) /* GRP_B3DS */
If you're using U-Boot then it will have a similar file somewhere in the source tree, looking very much like the above (Edit - "lowlevel_init.S" under "boards/* somewhere).
The way to test this is to add "memory set" commands to your booting script to set the above hex locations to different values. That way you can fairly easily test different drive strengths by "simply" changing the boot script rather than having to compile and load a different bootstrap.
You'll probably have to burn the new values into a boot for a proper final test, as the memory calibration is performed by the bootstrap with its burned-in drive strength values, and the calibrations might change when you change the drive strength (but testing using the scrip is doing this in the wrong order).
BTW, the Reference Manual is misleading with the Drive Strengths. "Table 43-2. DDR Output Driver Average Impedance" says "5" and "6" are the same resistance for DDR3. They're not. I had to physically measure them to prove they were different and there's a cut-and-paste bug in the table. Changing from "5" to "6" fixed our temperature problem.
In case it isn't obvious, the DDR Calibration procedure doesn't measure or set the Drive Strengths. The Stress Test doesn't change these either. You have to change these yourself and then use the Stress Test to see if you've got it right over temperature.
We have a large (noisy and power hungry) Temperature Test chamber here, that can test from -40C to +180C. I needed weeks of testing to gather the data I needed.
Here's how complicated it gets. The following shows the "best" and "worst" boards that we found during limited testing.
"A2" and "A6" were the last bytes of the hex Serial Number of the boards.
The light blue and purple lines show the temperature response of the "good board" (Serial A2) with the Drive Strength set to "5" and "7". At all temperatures it ran the stress test above 460MHz memory Clock speed. Setting it to "7" made it fail at a lower clock rate, but still well above the 400MHz it normally runs at.
The Red and Green lines show the failure temperatures of the "bad board" (Setial A6). It ran better at a Drive Strength of "7" than the other board did at "5", but looks at what it did when set to a drive strength of "5". It failed at 420MHz at 5C and at 400MHz at 110C.
Before we ran these tests, the default setting of the board as designed was "5". We were able to prove that a setting of "6" improved the "bad boards" without dropping the performance of the "good boards" over much.
If you don't have a Temperature Chamber, and haven't tested a large sample of your boards over the full operational range then you don't know what they'll fail at.
I have to admit that I have not read the full thread, so sorry if I'm wrong here.
There was a discussion on drivestrenght change over temperature.
Usually controller calibrate this out by a periodic drives strength calibration. I have seen older Freescale processors doing this, so I would be surprised if i.MX6 would not do this.
So if you expect signal integrity (aka drive strength) is the issue check if this feature is available and used.
in order to verify that the drive strength is OK worst case a signal integrity compliance check at hight temp is required.
Second I would expect big number of fails if it is a signal integrity issue. above I have seen one address and a single DQ line.
This is usually an indication that some cells in the DRAM array causing some issue.
So even you have automotive DRAMs you need to ensure to set the right refresh rates. When running into temperature issues this is the frist thing I'm looking at.
> I have to admit that I have not read the full thread, so sorry if I'm wrong here.
I don't think I can condense it into 140 characters for you
> Usually controller calibrate this out by a periodic drives strength calibration.
As I detailed above "There are TWENTY TWO different registers for setting the drive strength for the DDR alone." These aren't in the DDR controller.
The controller does support "ZQ Calibration", which is detailed here:
The above document says:
The ZQ calibration short (ZQCS) requires 64 clocks to complete so it is used periodically
when the DRAM is idle to perform calibrations to account for minor variations in voltage and temperature.
I haven't found anything in the manuals that says that the ESDCTL runs these calibrations itself, so it looks like it has to be actively commanded by the software running on the chip. In my reading, this calibration (and only SOME of the other ones) are run once by the Bootstrap, and then are never run again by any of the versions of Linux until the next boot. There aren't any "drivers" supplied that periodically calibrate the memory. If the SOC temperature is 0C on boot and 60C a few hours later, then good luck, as the memory system is on its own. If it corrupts, crashes and reboots it may recalibrate at the new temperature and may run a bit better.
It looks like you're meant to run the "Memory Stress Test", come up with a set of fixed calibration values for one board and then hard-code them.
In fact the code for our board used to run the "Write Levelling" pass, but then the manufacturer changed to a different RAM chip with a different interpretation of that test that their board didn't support, so they simply deleted that calibration pass and reverted to "fixed values".
For instance, as far as I can tell, here's the once-off DDR3 controller setup for the i.MX53 ARD board in U-Boot:
Edit: ... I'm trying to paste some code in here, but the "automatics" keeps turning it into a spreadsheet or removes the line feeds completely. This shouldn't be this HARD.
/* ESDCTL */
REG_LD_AND_STR_OP(25, 0x088, 0x35343535)
REG_LD_AND_STR_OP(26, 0x090, 0x4d444c44)
REG_LD_AND_STR_OP(27, 0x07c, 0x01370138)
REG_LD_AND_STR_OP(28, 0x080, 0x013b013c)
REG_LD_AND_STR_OP(29, 0x0f8, 0x00000800)
REG_LD_AND_STR_OP(30, 0x018, 0x00001740)
REG_LD_AND_STR_OP(31, 0x000, 0xc3190000)
REG_LD_AND_STR_OP(32, 0x00c, 0x9f5152e3)
REG_LD_AND_STR_OP(33, 0x010, 0xb68e8a63)
REG_LD_AND_STR_OP(34, 0x014, 0x01ff00db)
REG_LD_AND_STR_OP(35, 0x02c, 0x000026d2)
REG_LD_AND_STR_OP(36, 0x030, 0x009f0e21)
REG_LD_AND_STR_OP(37, 0x008, 0x12273030)
REG_LD_AND_STR_OP(38, 0x004, 0x0002002d)
REG_LD_AND_STR_OP(39, 0x01c, 0x00008032)
REG_LD_AND_STR_OP(40, 0x01c, 0x00008033)
REG_LD_AND_STR_OP(41, 0x01c, 0x00028031)
REG_LD_AND_STR_OP(42, 0x01c, 0x052080b0)
REG_LD_AND_STR_OP(43, 0x01c, 0x04008040)
REG_LD_AND_STR_OP(44, 0x01c, 0x0000803a)
REG_LD_AND_STR_OP(45, 0x01c, 0x0000803b)
REG_LD_AND_STR_OP(46, 0x01c, 0x00028039)
REG_LD_AND_STR_OP(47, 0x01c, 0x05208138)
REG_LD_AND_STR_OP(48, 0x01c, 0x04008048)
REG_LD_AND_STR_OP(49, 0x020, 0x00005800)
REG_LD_AND_STR_OP(50, 0x040, 0x04b80003)
REG_LD_AND_STR_OP(51, 0x058, 0x00022227)
REG_LD_AND_STR_OP(52, 0x01C, 0x00000000)
It is similar, but different to the "MXC_DCD" based "code" we're using with an i.MX53 running from NAND.
It isn't easy to understand, read or validate.