I answered some of you questions, please see inline.
asimzaidi wrote:
Hi Ofer
Thanks for your responses. We still need to understand why your boards pass the FSL DDR stress test yet fail in your application. We had an internal meeting to discuss the issues you are encountering and came up with some further questions/experiments.
Memory Testing
- · The failure you posted below is strange stating that the complete background pattern word was incorrect. Is this indicating that DDR was reading all 0’s instead of all F’s and vice versa, for multiple consecutive addresses. Is this consistently reproducible and what memory test (Bit Flip or other) is reporting this?
o If the Entire word is wrong or random this may indicate some issue with address and/or command signals.
YES, this type of failure is repeating (but not in every run). It happens when running "Solid Bits" and "Bit Flips".
I also noticed that when it happens, it happens in a burst of 4 or 8 failures, for example:
Loop 17/1000:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : testing 147FAILURE: 0xfffbffff != 0x00040000 at offset 0x00f8d440.
FAILURE: 0x00040000 != 0xfffbffff at offset 0x00f8d444.
FAILURE: 0xfffbffff != 0x00040000 at offset 0x00f8d448.
FAILURE: 0x00040000 != 0xfffbffff at offset 0x00f8d44c.
Walking Ones : ok
Walking Zeroes : ok
8-bit Writes : ok
16-bit Writes : ok
Loop 18/1000:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : testing 32FAILURE: 0x00000000 != 0xffffffff at offset 0x00b60f9c.
FAILURE: 0xffffffff != 0x00000000 at offset 0x00b60fa0.
FAILURE: 0x00000000 != 0xffffffff at offset 0x00b60fa4.
FAILURE: 0xffffffff != 0x00000000 at offset 0x00b60fa8.
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok
8-bit Writes : ok
16-bit Writes : ok
- · You have previously confirmed the DDR settings in the stress test initialization and UBOOT are the same. Can you please read out the MMDC registers after your DDR initialization in UBOOT.
Do you mean: dump the relevant registers from the u-boot prompt?
What I did previously, was to dump the relevant registers after the system was up, from kernel using memtool. Isn't it even better?
(Because that rules out the possibility that the kernel does something wrong.)
o We would like to confirm whether the registers you programmed are correctly and match the DDR stress test initialization ?
o Another similar experiment would be to run the FSL DDR stress test after UBOOT has initialized by attaching with JTAG.
- · Byte wise failures are usually indicative of a problem with the DQS signals. Either the DQS signals have too slow a rise/fall time, or there is a glitch or over/under shoots (signal integrity issues).
o It will be helpful to test over temperature per above to assist in narrowing down the issue with the DQS signals
o Ideally we recommend using calibration values which are a mean of multiple boards and temperatures
- · Do you see similar behavior/failures using both fixed and interleaved modes ?
Since we understood that we can't use 64-bit for LPDDR2, we switched to Interleaving mode, and haven't tried Fixed mode.
Clocking
- · Is your system changing the DDR frequency or does the system boot up and stay at 528 MHz?
No, u-boot sets the DDR frequency, and the kernel doesn't modify MXC_CCM_CBCMR and MXC_CCM_CBCDR.
I used this patch in order to achieve that in 3.0.35 4.1.0:
diff --git a/arch/arm/mach-mx6/clock.c b/arch/arm/mach-mx6/clock.c
index 48d3999..9067bbe 100644
--- a/arch/arm/mach-mx6/clock.c
+++ b/arch/arm/mach-mx6/clock.c
@@ -1390,12 +1390,12 @@ static int _clk_periph_set_parent(struct clk *clk, struct clk *parent)
reg = __raw_readl(MXC_CCM_CBCMR);
reg &= ~MXC_CCM_CBCMR_PRE_PERIPH_CLK_SEL_MASK;
reg |= mux << MXC_CCM_CBCMR_PRE_PERIPH_CLK_SEL_OFFSET;
- __raw_writel(reg, MXC_CCM_CBCMR);
+/* __raw_writel(reg, MXC_CCM_CBCMR); */
/* Set the periph_clk_sel multiplexer. */
reg = __raw_readl(MXC_CCM_CBCDR);
reg &= ~MXC_CCM_CBCDR_PERIPH_CLK_SEL;
- __raw_writel(reg, MXC_CCM_CBCDR);
+/* __raw_writel(reg, MXC_CCM_CBCDR); */
} else {
reg = __raw_readl(MXC_CCM_CBCDR);
/* Set the periph_clk2_podf divider to divide by 1. */
- · You had previously stated some issues in setting the DDR clock. Please refer to the following thread:
- · If modifying the DDR clock are you changing any other system clocks ?
No. I'm using this for 480MHz:
MXC_DCD_ITEM(1, CCM_BASE_ADDR + 0x14, 0x2018D00) // 480MHz
MXC_DCD_ITEM(2, CCM_BASE_ADDR + 0x18, 0x20324) // 480MHz
And keep the default for 528MHz.
HW Checks
- · To rule out any possible HW issues we would like to ensure that power supply and decoupling network on your board is correct . We are in the process of reviewing your provided design files as well.
- · Can you confirm if your board design meets the FSL decoupling requirements for the VDD_SOC and other domains as outlined in the i.MX6 HW users guide
o We have seen poor power delivery network can issues when stressing the part with higher instantaneous current requirements
YES, we reviewed our design, and it looks good. (Re: Re: ORCAM IPU/LPDDR2 Issues).
- · Could you try increasing the VDD_SOC domain as well as the 1V8 and LPDDR_1V2_DDR to see if this has any impact.
We increased the VDD_SOC to 1.375v and we noticed no influence. We didn't change 1V8 and LPDDR_1V2_DDR.
- · Do you have any boards using a different memory vendor just to rule out any DDR memory issues ?
Yes, we have some, but in the past we got poor results with them. We will try them again.
Errata Check
Can you please confirm that the BSP/kernel you are using has the patch for the following issue:
ERR003740 ARM/PL310: 752271—Double linefill feature can cause data corruption: only workaround to this erratum is to disable the double linefill feature.
/*
120 * The L2 cache controller(PL310) version on the i.MX6D/Q is r3p1-50rel0
121 * The L2 cache controller(PL310) version on the i.MX6DL/SOLO/SL is r3p2
122 * But according to ARM PL310 errata: 752271
123 * ID: 752271: Double linefill feature can cause data corruption
124 * Fault Status: Present in: r3p0, r3p1, r3p1-50rel0. Fixed in r3p2
125 * Workaround: The only workaround to this erratum is to disable the
126 * double linefill feature. This is the default behavior.
127 */
128 if (!cpu_is_mx6q())
129 val |= 0x40800000;
130 writel(val, IO_ADDRESS(L2_BASE_ADDR + L2X0_PREFETCH_CTRL));
131
132 val = readl(IO_ADDRESS(L2_BASE_ADDR + L2X0_POWER_CTRL));
133 val |= L2X0_DYNAMIC_CLK_GATING_EN;
134 val |= L2X0_STNDBY_MODE_EN;
135 writel(val, IO_ADDRESS(L2_BASE_ADDR + L2X0_POWER_CTRL));
136
137 l2x0_init(IO_ADDRESS(L2_BASE_ADDR), 0x0, ~0x00000000);
Confirmed - I found the marked code in arch/arm/mach-mx6/mm.c
We realize that this is a slow and painful debug exercise but we are hopeful that we will discover the root cause of your instabilities.
Regards
Asim
Ofer F.