How to check the single-bit error of DDR4 have been corrected on LX2080A

RoseWen · ‎02-18-2024

Hi,

When I test ECC error injection with EDAC driver on LX2080A board, it will print debug message as below.

1) Error Detect Register 2) Fault Data bit

3) Expected Data and ECC 4) Captured Data and ECC

5) Error address 6) PFN

The ce_count will keep increasing until disable single-bit error interrupt by manual. (Why it will always trigger edac interrupt even thought error injection disable, asked in other question but no reply yet)

But the ce_count is "correctable" error count, I want to know how to check the single-bit error "has been corrected".

And How to read DDR data of error address in linux? (example: Err addr: 0x3e0bfff00)

Please refer to the attached test log.

Thanks,

Rose

yipingwang · ‎02-26-2024

the ECC in LX2080A is a SECDED, this means single bit error correction and double bit error detection. the single bit errors will increment the SBEC (single bit error counter) till it reaches the SBET(single bit error threshold). once the threshold is reached the ERR_DETECT register SBE flag will be set. you can detect the SBE either by checking the SBEC or the ERR_DETECT register.

the multi-bit errors, only guarantee any two bit flip detections, and ERR_DETECT MBE flag will be set. if you do more than two bit flip in your test, the ECC may or may not be able to detect it.

RoseWen · ‎02-26-2024

Hi,

The information you meationed can be found in the LX2080A datasheet.

And didn't answer my question at all.

My question is "How to check the single-bit error has been corrected?"

How to read error address 0x3e0bfff00? Is it the phycisal address or virtual address?

Rose

yipingwang · ‎02-26-2024

Investigating

yipingwang · ‎02-28-2024

the starting address from ECC (memory controller) point of view is the value listed in the BNDS registers. from the SoC point of view DDR memory space starts at 0x8000_0000, so you need to add 0x8000_0000 to the address you read from ECC register to get the address that is failing

RoseWen · ‎02-28-2024

Hi

How to add 0x8000_0000 to the address you read from ECC register?

Rose

yipingwang · ‎03-03-2024

0x3_e0bf_ff00 + 0x8000_0000 = 0x4_60bf_ff00

RoseWen · ‎03-03-2024

Hi

If read this address via devmem will kernel panic as below message

-----------------------------------------------------------------------------

root@localhost:/dni# devmem 0x460bfff00
[ 78.099537] SError Interrupt on CPU0, code 0xbf000002 -- SError
[ 78.099539] CPU: 0 PID: 2997 Comm: devmem Tainted: G O 5.10.35 #1
[ 78.099541] Hardware name: NXP Layerscape LX2160ARDB (DT)
[ 78.099543] pstate: 60000000 (nZCv daif -PAN -UAO -TCO BTYPE=--)
[ 78.099545] pc : 0000aaaae2190d04
[ 78.099546] lr : 0000aaaae2190c1c
[ 78.099548] sp : 0000ffffcacfee70
[ 78.099549] x29: 0000ffffcacfee70 x28: 0000000000000000
[ 78.099554] x27: 0000000000000000 x26: 0000000000000000
[ 78.099558] x25: 0000000000000000 x24: 0000000000000000
[ 78.099562] x23: 0000000000000000 x22: 0000000000000000
[ 78.099566] x21: 0000aaaae21908f0 x20: 0000000000000000
[ 78.099570] x19: 0000aaaae2190df0 x18: 0000000073516240
[ 78.099574] x17: 0000ffffad5aec50 x16: 0000aaaae21a1f98
[ 78.099577] x15: 000000006fffff47 x14: 0000000000000000
[ 78.099581] x13: 0000000000000000 x12: 0000ffffad4e5208
[ 78.099585] x11: ffffffffffffffff x10: 0000000000000010
[ 78.099589] x9 : 000000000000000f x8 : 00000000000000de
[ 78.099593] x7 : 0000ffffad5f38a8 x6 : 0000000000001000
[ 78.099597] x5 : 0000000460bff000 x4 : 0000000000000003
[ 78.099601] x3 : 0000000000000001 x2 : 0000000000000001
[ 78.099604] x1 : 0000ffffad68d000 x0 : 0000000000000008
[ 78.099609] Kernel panic - not syncing: Asynchronous SError Interrupt
[ 78.099611] CPU: 0 PID: 2997 Comm: devmem Tainted: G O 5.10.35 #1
[ 78.099612] Hardware name: NXP Layerscape LX2160ARDB (DT)
[ 78.099614] Call trace:
[ 78.099615] dump_backtrace+0x0/0x1a8
[ 78.099617] show_stack+0x18/0x68
[ 78.099618] dump_stack+0xd0/0x12c
[ 78.099619] panic+0x16c/0x334
[ 78.099621] nmi_panic+0x8c/0x90
[ 78.099622] arm64_serror_panic+0x78/0x84
[ 78.099624] do_serror+0x38/0x98
[ 78.099625] el0_error_naked+0x14/0x1c
[ 78.099656] SMP: stopping secondary CPUs
[ 78.099658] Kernel Offset: 0x2b5cfb600000 from 0xffff800010000000
[ 78.099660] PHYS_OFFSET: 0xffff993740000000
[ 78.099661] CPU features: 0x0240022,21806008
[ 78.099663] Memory Limit: none

----------------------------------------------------------------------------

Rose

yipingwang · ‎03-04-2024

My previous decode would only be valid if the value of address is less than 2GB. but in your case the address is beyond the 2GB please examine below:

the address listed in ECC register is from the DDR controller BNDS registers --> 0x3_e0bf_ff00 . when you want to read this location from SoC you need to map it to the System memory map.

in RM examine the System memory map.

DRAM region #1 (0-2 GB of memory controller ) maps to 0x8000_0000 to 0xFFFF_FFFF DRAM region #2 ( 2GB to 128GB) maps to 0x20_8000_0000 to 0x3F_FFFF_FFFF

the address 0x3_e0bf_ff00 would fall in region #2 and it would map to system memory map with the following calculation:

0x3_e0bf_ff00 - 0x8000_0000 + 0x20_8000_0000 = 0x23_e0bf_ff00

yipingwang · ‎03-03-2024

I will discuss it with the AE team.