How to fix ECC of DDR4 training address after warm boot?

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

How to fix ECC of DDR4 training address after warm boot?

Jump to solution
1,851 Views
ruudpendavingh
Contributor II

Hi,

I have a custom board with an LS1046A and 4GB DDR4 ECC memory and U-Boot. I successfully found a way to perform a 'warm' reboot (preserving DRAM contents) by setting the FRC_SR bit of the DDR_SDRAM_CFG_2 register (while executing from internal SRAM). My U-boot code implements the is_warm_boot() function to distinguish between warm/cold boots. I do not have SPD; the DDR controller configuration has been created using NXP CodeWarrior.

There are no issues after a cold boot (memory fully initialized by the DDR controller); a stressapptest will run for days on linux without problems.

After a warm boot, all memory seems nicely preserved except the (128 byte?) area where the DDR training occurs . In the fsl_ddr_gen4.c source file the DDR training address for warm boot is set to CONFIG_SYS_SDRAM_BASE (defined as 0x8000_0000):

#ifdef CONFIG_DEEP_SLEEP
    if (is_warm_boot()) {
        ddr_out32(&ddr->sdram_cfg_2,
              regs->ddr_sdram_cfg_2 & ~SDRAM_CFG2_D_INIT);
        ddr_out32(&ddr->init_addr, CONFIG_SYS_SDRAM_BASE);
        ddr_out32(&ddr->init_ext_addr, DDR_INIT_ADDR_EXT_UIA);

        /* DRAM VRef will not be trained */
        ddr_out32(&ddr->ddr_cdr2,
              regs->ddr_cdr2 & ~DDR_CDR2_VREF_TRAIN_EN);
    } else
#endif

Apparently this means that training is performed on CPU physical address 0x8_8000_0000. Reading from a location between 0x8_8000_0000 and 0x8_8000_007F will cause an error, but the area above can be accessed:

=> mw.q 0x880000080 deadc0dec0ffabba
=> md.q 0x880000080 10
880000080: deadc0dec0ffabba 0000000000000000    ................
880000090: 0000000000000000 0000000000000000    ................
8800000a0: 0000000000000000 0000000000000000    ................
8800000b0: 0000000000000000 0000000000000000    ................
8800000c0: 0000000000000000 0000000000000000    ................
8800000d0: 0000000000000000 0000000000000000    ................
8800000e0: 0000000000000000 0000000000000000    ................
8800000f0: 0000000000000000 0000000000000000    ................

I attempt to fix the ECC of the training area using 64-bit writes:

=> mw.q 0x880000000 0 10

But a read after that will fail:

=> md.q 0x880000000
880000000:"Synchronous Abort" handler, esr 0x96000210
elr: 0000000040145994 lr : 00000000401457fc (reloc)
elr: 00000000fbdbb994 lr : 00000000fbdbb7fc
x0 : 0000000000000010 x1 : 000000000000003a
x2 : 0000000000000020 x3 : 0000000000000001
x4 : 00000000fbc6aed8 x5 : 0000000000000009
x6 : 0000000000000021 x7 : 00000000fffffffd
x8 : 0000000000000038 x9 : 000000000000000c
x10: 00000000fbdc85b8 x11: 000000000000000f
x12: 0000000000000004 x13: 00000000fbc6b3d0
x14: 00000000fbc6b6d8 x15: 00000000fbc6b058
x16: 0000000000000000 x17: 00000000ffffffff
x18: 00000000fbc6dd78 x19: 0000000000000011
x20: 0000000000000002 x21: 00000000fbc6b3c8
x22: 0000000000000002 x23: 0000000000000008
x24: 0000000000000008 x25: 00000000fbdd0740
x26: 0000000880000000 x27: 0000000000000010
x28: 0000000000000000 x29: 00000000fbc6b330

How can I fix the ECC of the training area after a warm boot?

Is the DDR_INIT_ADDR the DDR address or a physical CPU address?

Regards,

Ruud

0 Kudos
1 Solution
1,674 Views
ufedor
NXP Employee
NXP Employee

In the provided dump:

01080140: 00000000 00000000 00000080 00000080

which corresponds to the DDR controller behaviour - i.e.:

DDR_INIT_ADDR = 0x80000000 

DDR_INIT_EXT_ADDRESS = 0x80000000 - UIA=1 (Use the initialization address programmed in DDR_INIT_ADDR and DDR_INIT_EXT_ADDR)

The area of 0x80 bytes at the initialization address is used for the DDR controller calibration.

Data in this area is destroyed during warm boot and correct ECC is not generated.

Bold values "1234567890abcdef" in your experiment correspond to two 64-bit words by chance having single-bit errors which are corrected by the DDR controller. All other 64-bit words have multi-bit errors.

The initial "Synchronous Abort" is generated because multiple memory error is detected by the DDR controller when whole cache line read operation is attempted because of a single 64-bit word write into this cache line.

Try to set bits MBED and SBED in the ERR_DISABLE register before renewing the initialization area.

View solution in original post

5 Replies
1,674 Views
ruudpendavingh
Contributor II

When I disable the dcache after a warm reboot, U-boot won't crash on reading the DDR training addresses anymore:

=> dcache flush
=> dcache off
=> dcache flush
=> md.q 0x880000000 20
880000000: 5757575757575757 0000000000000000    WWWWWWWW........
880000010: 5757575757575757 0000000000000000    WWWWWWWW........
880000020: 0000000000000000 575757575757575f    ........_WWWWWWW
880000030: 0000000000000000 5757575757575757    ........WWWWWWWW
880000040: fcf8f8f8f8f8f8f8 a8a8a8a8a8a8a8a8    ................
880000050: a8a8a8a8a8a8a8a8 ffffffffffffffff    ................
880000060: ffffffffffffffff a8a8a8a8a8a8a8a8    ................
880000070: ffffffffffffffff a8a8a8a8a8a8a8a8    ................
880000080: 0000000000000000 0000000000000000    ................
880000090: 0000000000000000 0000000000000000    ................
8800000a0: 0000000000000000 0000000000000000    ................
8800000b0: 0000000000000000 0000000000000000    ................
8800000c0: 0000000000000000 0000000000000000    ................
8800000d0: 0000000000000000 0000000000000000    ................
8800000e0: 0000000000000000 0000000000000000    ................
8800000f0: 0000000000000000 0000000000000000    ................

But writing to the affected addresses gets "partly ignored" ??:

=> mw.q 0x880000000 1234567890abcdef 20
=> md.q 0x880000000 20
880000000: 5757575757575757 0000000000000000    WWWWWWWW........
880000010: 5757575757575757 0000000000000000    WWWWWWWW........
880000020: 0000000000000000 1234567890abcdef    ............xV4.
880000030: 0000000000000000 5757575757575757    ........WWWWWWWW
880000040: 1234567890abcdef a8a8a8a8a8a8a8a8    ....xV4.........
880000050: a8a8a8a8a8a8a8a8 ffffffffffffffff    ................
880000060: ffffffffffffffff a8a8a8a8a8a8a8a8    ................
880000070: ffffffffffffffff a8a8a8a8a8a8a8a8    ................
880000080: 1234567890abcdef 1234567890abcdef    ....xV4.....xV4.
880000090: 1234567890abcdef 1234567890abcdef    ....xV4.....xV4.
8800000a0: 1234567890abcdef 1234567890abcdef    ....xV4.....xV4.
8800000b0: 1234567890abcdef 1234567890abcdef    ....xV4.....xV4.
8800000c0: 1234567890abcdef 1234567890abcdef    ....xV4.....xV4.
8800000d0: 1234567890abcdef 1234567890abcdef    ....xV4.....xV4.
8800000e0: 1234567890abcdef 1234567890abcdef    ....xV4.....xV4.
8800000f0: 1234567890abcdef 1234567890abcdef    ....xV4.....xV4.

What's happening here?

Regards,

Ruud

0 Kudos
1,674 Views
ufedor
NXP Employee
NXP Employee

Please provide U-Boot dump of the initialized DDR controller registers as textual attachment.

0 Kudos
1,674 Views
ruudpendavingh
Contributor II

Attached the DDR register dump after a warm boot.

The DDR type is Micron MT40A512M8RH-083E.

Regards,

Ruud

0 Kudos
1,675 Views
ufedor
NXP Employee
NXP Employee

In the provided dump:

01080140: 00000000 00000000 00000080 00000080

which corresponds to the DDR controller behaviour - i.e.:

DDR_INIT_ADDR = 0x80000000 

DDR_INIT_EXT_ADDRESS = 0x80000000 - UIA=1 (Use the initialization address programmed in DDR_INIT_ADDR and DDR_INIT_EXT_ADDR)

The area of 0x80 bytes at the initialization address is used for the DDR controller calibration.

Data in this area is destroyed during warm boot and correct ECC is not generated.

Bold values "1234567890abcdef" in your experiment correspond to two 64-bit words by chance having single-bit errors which are corrected by the DDR controller. All other 64-bit words have multi-bit errors.

The initial "Synchronous Abort" is generated because multiple memory error is detected by the DDR controller when whole cache line read operation is attempted because of a single 64-bit word write into this cache line.

Try to set bits MBED and SBED in the ERR_DISABLE register before renewing the initialization area.

1,674 Views
ruudpendavingh
Contributor II

Thanks, that works!

0 Kudos