There are five related questions for the layerscape SoC and DDR memory controller support team and the SDK team.
1) Will the DDR controller or SOC do an automatic patrol scrub for single bit errors in the hardware itself without software intervention?
2) If not, where and how should I enable the periodic patrol scrub feature in software? Please provide a detailed description, not a one line answer.
Layerscape SoCs have the feature to fix any detected single-bit errors. It is not part of the EDAC driver to correct single bit errors. The error is still counted so the EDAC driver can "see" this error and report the error. You can refer to SoC reference manual.
3) This raises the question, does the SoC or memory controller do a periodic memory scrub without any software intervention to fix single bit errors, and if so, how does the SoC or memory controller correct the single bit errors without doing a periodic memory scrub?
The patrol scrub is a feature of the memory controller and I need know a) do I need to enable it or is it automatically done by the controller, b) what I need to do enable the controller to do periodic patrol scrubs if software intervention is needed, and c) where should it be enabled and how. How do I program the controller and where do I program it to enable periodic control scrubs for the layerscape product? Can and should it be done in the EDAC layerscape driver?
4) I have the layerscape EDAC driver but it does not do a patrol scrub. Some EDAC drivers do patrol scrubs, should the patrol scrub be enabled and programmed in the layerscape EDAC driver? Why/or why not?
Assume a standard LS1043ARDB architecture with a DDR4, same IFS/DDR memory controller as the LS1043ARDB, and 4 ECC lines have been correctly added to the board for error correction and validated as working properly. Can catch correctable single bit and uncorrectable multi-bit errors using the EDAC driver and they are correctly reported using the edac-util and the EDAC driver. NOTE: Added all support needed for ECC and the ECC is enabled.
5) Finally, I see the ECC_FIX_EN, should this be enabled by the layerscape EDAC driver to do periodic patrol scrubs? This is the last piece to this effort and need to make sure I correctly do a periodic scrub to fix any memory.
ECC fixing enable.
The DDR controller supports ECC fixing in memory. In this mode, the DDR controller will automatically fix
any detected single-bit errors by issuing a new transaction to read the address with the failing bit,
correcting the bit, and writing the data back to memory. The single-bit error will still be counted in the
ERR_SBE register for this case, but the controller will automatically fix the error. Note that during the
'read back', the single-bit error will not be double counted in the ERR_SBE register. In addition, the DDR
controller will periodically issue a read to memory at the interval defined by ECC_SCRUB_INT. If a
single-bit error is detected during a periodic read, it will be fixed. In this case, the error will be reported as
an SSBE in the ERR_SBE register. If a multi-bit eror is detected, then it will be reported in the
ERR_DETECT register. Also note that if a subsequent single-bit error is detected at the same address
while a first error is being fixed, then the second error will not be reported. Also, after a first SBE is
detected, no other SBEs will be fixed until the first SBE has been fixed in memory.This bit should only be
set if DDR_SDRAM_CFG[ECC_EN] is also set.
NOTE: Scrubbing cannot be enabled until after the controller has cleared
0b - ECC scrubbing is disabled.
1b - ECC scrubbing is enabled.
Here are the reasons for doing a periodic scrub:
There are multiple layers to the problem of ECC.
First layer, there is the immediate 'correction' of a flipped bit.
This does not 'fix' the source of the error but corrects the flipped bit for use by the processor.
Most bit flips will be due to either a transitory noise problem on the bus, which will not be associated with any given memory cell, OR it will be due to a cosmic-ray induced bit flip in the memory cell which will stay 'flipped' until the location has been written to again.
The safe action is to write the ECC corrected data back to the same 'error' location in memory. Does the layerscape memory controller or SoC do this and how without software intervention?
Second layer, there is the risk of a double bit flip in memory.
Statistically this is very rare, but the odds significantly increase that a double bit flip will occur in a single word when a single bit flip goes uncorrected, giving more time for another cosmic ray induced bit flip to occur in that word.
The layerscape memory controller can only detect a bit-flip when a given location is read, correct? This is different from normal DRAM refresh routines. If a location is not normally read, it can go 'unserviced' indefinitely, allowing multiple bit flips to accumulate. Therefore a periodic control scrub is needed.
By periodically (once a day should be more than sufficient overkill) reading each location in the DRAM and writing that same (automatically ECC corrected if correction was needed) value back into the DRAM, we drastically reduce the potential for an uncorrectable multiple bit error to accumulate in any given word in memory.
Third layer, there is how the memory controller handles UE errors. My understanding is that the layerscape memory controller, can detect if it is a single bit (correctable) error or a multi-bit error that is not correctable. Is this the case?
An uncorrectable error in the data or the software will have consequences ranging from negligible to critical. From a hardware standpoint it can't tell if it is critical so it must assume it is. Should the software panic on uncorrectable errrors or simply limp along hoping nothing is corrupted, or do a graceful reset on an uncorrectable error?