How to enable and program a periodic patrol scrub on layerscape?

tracysmith · ‎12-06-2018

There are five related questions for the layerscape SoC and DDR memory controller support team and the SDK team.

1) Will the DDR controller or SOC do an automatic patrol scrub for single bit errors in the hardware itself without software intervention?

2) If not, where and how should I enable the periodic patrol scrub feature in software? Please provide a detailed description, not a one line answer.

Layerscape SoCs have the feature to fix any detected single-bit errors. It is not part of the EDAC driver to correct single bit errors. The error is still counted so the EDAC driver can "see" this error and report the error. You can refer to SoC reference manual.

3) This raises the question, does the SoC or memory controller do a periodic memory scrub without any software intervention to fix single bit errors, and if so, how does the SoC or memory controller correct the single bit errors without doing a periodic memory scrub?

The patrol scrub is a feature of the memory controller and I need know a) do I need to enable it or is it automatically done by the controller, b) what I need to do enable the controller to do periodic patrol scrubs if software intervention is needed, and c) where should it be enabled and how. How do I program the controller and where do I program it to enable periodic control scrubs for the layerscape product? Can and should it be done in the EDAC layerscape driver?

4) I have the layerscape EDAC driver but it does not do a patrol scrub. Some EDAC drivers do patrol scrubs, should the patrol scrub be enabled and programmed in the layerscape EDAC driver? Why/or why not?

Assume a standard LS1043ARDB architecture with a DDR4, same IFS/DDR memory controller as the LS1043ARDB, and 4 ECC lines have been correctly added to the board for error correction and validated as working properly. Can catch correctable single bit and uncorrectable multi-bit errors using the EDAC driver and they are correctly reported using the edac-util and the EDAC driver. NOTE: Added all support needed for ECC and the ECC is enabled.

5) Finally, I see the ECC_FIX_EN, should this be enabled by the layerscape EDAC driver to do periodic patrol scrubs? This is the last piece to this effort and need to make sure I correctly do a periodic scrub to fix any memory.

ECC_FIX_EN
ECC fixing enable.
The DDR controller supports ECC fixing in memory. In this mode, the DDR controller will automatically fix
any detected single-bit errors by issuing a new transaction to read the address with the failing bit,
correcting the bit, and writing the data back to memory. The single-bit error will still be counted in the
ERR_SBE register for this case, but the controller will automatically fix the error. Note that during the
'read back', the single-bit error will not be double counted in the ERR_SBE register. In addition, the DDR
controller will periodically issue a read to memory at the interval defined by ECC_SCRUB_INT. If a
single-bit error is detected during a periodic read, it will be fixed. In this case, the error will be reported as
an SSBE in the ERR_SBE register. If a multi-bit eror is detected, then it will be reported in the
ERR_DETECT register. Also note that if a subsequent single-bit error is detected at the same address
while a first error is being fixed, then the second error will not be reported. Also, after a first SBE is
detected, no other SBEs will be fixed until the first SBE has been fixed in memory.This bit should only be
set if DDR_SDRAM_CFG[ECC_EN] is also set.
NOTE: Scrubbing cannot be enabled until after the controller has cleared
DDR_SDRAM_CFG_2[D_INIT].
0b - ECC scrubbing is disabled.
1b - ECC scrubbing is enabled.

Here are the reasons for doing a periodic scrub:

There are multiple layers to the problem of ECC.

First layer, there is the immediate 'correction' of a flipped bit.

This does not 'fix' the source of the error but corrects the flipped bit for use by the processor.

Most bit flips will be due to either a transitory noise problem on the bus, which will not be associated with any given memory cell, OR it will be due to a cosmic-ray induced bit flip in the memory cell which will stay 'flipped' until the location has been written to again.

The safe action is to write the ECC corrected data back to the same 'error' location in memory. Does the layerscape memory controller or SoC do this and how without software intervention?

Second layer, there is the risk of a double bit flip in memory.

Statistically this is very rare, but the odds significantly increase that a double bit flip will occur in a single word when a single bit flip goes uncorrected, giving more time for another cosmic ray induced bit flip to occur in that word.

The layerscape memory controller can only detect a bit-flip when a given location is read, correct? This is different from normal DRAM refresh routines. If a location is not normally read, it can go 'unserviced' indefinitely, allowing multiple bit flips to accumulate. Therefore a periodic control scrub is needed.

By periodically (once a day should be more than sufficient overkill) reading each location in the DRAM and writing that same (automatically ECC corrected if correction was needed) value back into the DRAM, we drastically reduce the potential for an uncorrectable multiple bit error to accumulate in any given word in memory.

Third layer, there is how the memory controller handles UE errors. My understanding is that the layerscape memory controller, can detect if it is a single bit (correctable) error or a multi-bit error that is not correctable. Is this the case?

An uncorrectable error in the data or the software will have consequences ranging from negligible to critical. From a hardware standpoint it can't tell if it is critical so it must assume it is. Should the software panic on uncorrectable errrors or simply limp along hoping nothing is corrupted, or do a graceful reset on an uncorrectable error?

tracysmith · ‎12-14-2018

https://community.nxp.com/community/training/ftf-2015-training-presentations/projects/ddr-and-hardwa... NXP DDR Forum

DDR Initialization questions:

1. Should I set the ECC_FIX_IN in set_ddr_sdram_cfg_3() or is this too early during the DDR initialization process?

2. Where in the DDR initialization code in uboot will the hardware clear D_INIT?

Bulat · ‎12-18-2018

1. It is too early.

2. For DDR4 software is polling for clearing D_INIT in fsl_ddr_gen4.c

tracysmith · ‎12-21-2018

In fsl_ddr_gen4.c D_INIT is cleared, still unable to set CFG3 register in fsl_ddr_gen4.c. Why? As long as D_INIT is cleared, ECC_FIX_EN and ECC_SCRUB_INT should be able to be set.

I print the CFG2 register and it is 0x00401140, I then set the CFG3 register to 0x00000048. When I read back the CFG3 register after setting it, it is 0x00000000. NXP has never tested the periodic patrol scrub feature of the controller. Do you know if it is even functioning? Why can the CFG3 not be set when D_INIT is cleared?

Bulat · ‎12-21-2018

Can you try to write 0x48000000 to the CFG3 instead of 0x00000048?

tracysmith · ‎12-22-2018

=> md 0x1080260
01080260: 00000048 00000000 00000000 00000000 H...............

The register is now getting set and I need to test. You wrote, "there are no flags indicating process flow. Possible way is to inject errors into some memory locations and check those location after the scrubbing period is over."

How would we inject errors in memory that are not automatically corrected by the memory controller? Once injected, they are immediately corrected by the controller and correction doesn't wait for a periodic scrub to get corrected. How can we get around this so we have a correctable error when the periodic scrub does the scrub?

tracysmith · ‎12-14-2018

Assuming that set_ddr_sdram_cfg_3() is clearing the register with the bitwise and and the shift, then the following should set a refresh rate of 128 and enable ECC_FIX_EN for the periodic scrub. Will this work correctly? If so, then need to find where DDR_SDRAM_CFG_2[D_INT] is cleared.

static void set_ddr_sdram_cfg_3(fsl_ddr_cfg_regs_t *ddr,
const memctl_options_t *popts)
{
int rd_pre;

rd_pre = popts->quad_rank_present ? 1 : 0;

ddr->ddr_sdram_cfg_3 = (rd_pre & 0x1) << 16;

if (popts->ecc_fix_en)
ddr->ddr_sdram_cfg_3 = 0x00000048;

debug("FSLDDR: ddr_sdram_cfg_3 = 0x%08x\n", ddr->ddr_sdram_cfg_3);
}

Bulat · ‎12-07-2018

Yes, the DDR controller of the layerscape devices is able to provide an automatic patrol scrub for single bit errors in the hardware without software intervention. As you correctly mentioned in question 5) that feature is enabled by ECC_FIX_EN and periodically reads full memory using internal address counter and provides correction of single bit errors in the memory if detected. Service period can be chosen by ECC_SCRUB_INT, minimum is equal to a refresh period. It can be enabled in uboot as soon as D_INIT bit is cleared by hardware during DDR initialization.

Regards,

Bulat

tracysmith · ‎12-07-2018

Three additional questions. I want to set the periodic scrub to scrub once a day and I need to validate the the scrub occurs once a day.

1) The refresh period for the value of ECC_SCRUB_INT should be what value for a once a day periodic scrub?

2) How do I validate the periodic scrub is scrubbing once a day once set?

3) The memory controller manages the scrub and how often it scrubs once it is enabled and once ECC_SCRUB_INT is enabled and nothing else is required by the software or the EDAC driver, correct?

Bulat · ‎12-10-2018

1) The refresh period for the value of ECC_SCRUB_INT should be what value for a once a day periodic scrub?

This depends on the memory size (mem_size) and refresh period (refresh_time). Following formula can be used to calculate number of refresh periods between two scrubbing services:

n = day_time x 64 / mem_size / refresh_period

Since ECC_SCRUB_INT can be set to specific values (correspondning to powers of 2), you need to round resulting 'n' to a nearest possible value. For example, if n = 165, then it should be rounded to either 128 or 256. ECC_SCRUB_INT values are 4b1000 or 4b1001. In the first case scrubbing the entire memory takes less than a day, in the second case - greater than a day.

2) How do I validate the periodic scrub is scrubbing once a day once set?

I am not sure what means "validate". There are no HW flags that entire memory has been scrubbed.

3) The memory controller manages the scrub and how often it scrubs once it is enabled and once ECC_SCRUB_INT is enabled and nothing else is required by the software or the EDAC driver, correct?

Correct.

tracysmith · ‎12-10-2018

Also, in the formula:

n = day_time x 64 / mem_size / refresh_period and n is rounded up to the next power of 2.

1. What is day_time? Is this the time of day we want the scrub to occur like 17:00, or some n hours cycle of time, or what is referred to by day_time.

The mem_size for the DDR4 is 64 GBytes. The refresh period is what is suggested by the hardware reference manual of 7.8 microseconds to complete the scrub in 2.35 hours.

1. I don’t understand the “ECC_SCRUB_INT values are 4b1000 or 4b1001?” Is this the n value plus setting or clearing the last bit?

Bulat · ‎12-11-2018

1. day_time is the 'size' of the day, your question was "...the scrub occurs once a day". Lets consider that day_time is 24 hours or 86400 seconds.

1. “ECC_SCRUB_INT values are 4b1000 or 4b1001” Please refer to the Ref Manual of any Layerscape processor, it explains possible values of the ECC_SCRUB_INT field. For example, value 4b1000 corresponds to 128 refresh periods.

tracysmith · ‎12-10-2018

I am not sure what means "validate". There are no HW flags that entire memory has been scrubbed.

I mean how can I verify or check or test that the scrub is taking place after I have enabled the scrub and ECC_SCRUB_INT is set to 4b1000?

Get Outlook for iOS<https://aka.ms/o0ukef>

Bulat · ‎12-11-2018

As I wrote, there are no flags indicating process flow. Possible way is to inject errors into some memory locations and check those location after the scrubbing period is over.