How to check the single-bit error of DDR4 have been corrected on LX2080A

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

How to check the single-bit error of DDR4 have been corrected on LX2080A

1,188 Views
RoseWen
Contributor I

Hi,

When I test ECC error injection with EDAC driver on LX2080A board, it will print debug message as below.

1) Error Detect Register 2) Fault Data bit

3) Expected Data and ECC 4) Captured Data and ECC

5) Error address 6) PFN 

 

The ce_count will keep increasing until disable single-bit error interrupt by manual. (Why it will always trigger edac interrupt even thought error injection disable, asked in other question but no reply yet)

But the ce_count is "correctable" error count, I want to know how to check the single-bit error "has been corrected".

And How to read DDR data of error address in linux? (example: Err addr: 0x3e0bfff00)

 

Please refer to the attached test log.

 

Thanks,

Rose

0 Kudos
Reply
9 Replies

1,133 Views
yipingwang
NXP TechSupport
NXP TechSupport

the ECC in LX2080A is a SECDED, this means single bit error correction and double bit error detection. the single bit errors will increment the SBEC (single bit error counter) till it reaches the SBET(single bit error threshold). once the threshold is reached the ERR_DETECT register SBE flag will be set. you can detect the SBE either by checking the SBEC or the ERR_DETECT register.

the multi-bit errors, only guarantee any two bit flip detections, and ERR_DETECT MBE flag will be set. if you do more than two bit flip in your test, the ECC may or may not be able to detect it.

0 Kudos
Reply

1,125 Views
RoseWen
Contributor I

Hi,

The information you meationed can be found in the LX2080A datasheet.

And didn't answer my question at all.

My question is "How to check the single-bit error has been corrected?"

How to read error address 0x3e0bfff00? Is it the phycisal address or virtual address?

 

Rose

 

0 Kudos
Reply

1,119 Views
yipingwang
NXP TechSupport
NXP TechSupport

Investigating

0 Kudos
Reply

1,088 Views
yipingwang
NXP TechSupport
NXP TechSupport

the starting address from ECC (memory controller) point of view is the value listed in the BNDS registers. from the SoC point of view DDR memory space starts at 0x8000_0000, so you need to add 0x8000_0000 to the address you read from ECC register to get the address that is failing

0 Kudos
Reply

1,086 Views
RoseWen
Contributor I

Hi

How to add 0x8000_0000 to the address you read from ECC register?

Rose

0 Kudos
Reply

1,016 Views
yipingwang
NXP TechSupport
NXP TechSupport

0x3_e0bf_ff00 + 0x8000_0000 = 0x4_60bf_ff00

0 Kudos
Reply

1,014 Views
RoseWen
Contributor I

Hi

If read this address via devmem will kernel panic as below message

-----------------------------------------------------------------------------

root@localhost:/dni# devmem 0x460bfff00
[ 78.099537] SError Interrupt on CPU0, code 0xbf000002 -- SError
[ 78.099539] CPU: 0 PID: 2997 Comm: devmem Tainted: G O 5.10.35 #1
[ 78.099541] Hardware name: NXP Layerscape LX2160ARDB (DT)
[ 78.099543] pstate: 60000000 (nZCv daif -PAN -UAO -TCO BTYPE=--)
[ 78.099545] pc : 0000aaaae2190d04
[ 78.099546] lr : 0000aaaae2190c1c
[ 78.099548] sp : 0000ffffcacfee70
[ 78.099549] x29: 0000ffffcacfee70 x28: 0000000000000000
[ 78.099554] x27: 0000000000000000 x26: 0000000000000000
[ 78.099558] x25: 0000000000000000 x24: 0000000000000000
[ 78.099562] x23: 0000000000000000 x22: 0000000000000000
[ 78.099566] x21: 0000aaaae21908f0 x20: 0000000000000000
[ 78.099570] x19: 0000aaaae2190df0 x18: 0000000073516240
[ 78.099574] x17: 0000ffffad5aec50 x16: 0000aaaae21a1f98
[ 78.099577] x15: 000000006fffff47 x14: 0000000000000000
[ 78.099581] x13: 0000000000000000 x12: 0000ffffad4e5208
[ 78.099585] x11: ffffffffffffffff x10: 0000000000000010
[ 78.099589] x9 : 000000000000000f x8 : 00000000000000de
[ 78.099593] x7 : 0000ffffad5f38a8 x6 : 0000000000001000
[ 78.099597] x5 : 0000000460bff000 x4 : 0000000000000003
[ 78.099601] x3 : 0000000000000001 x2 : 0000000000000001
[ 78.099604] x1 : 0000ffffad68d000 x0 : 0000000000000008
[ 78.099609] Kernel panic - not syncing: Asynchronous SError Interrupt
[ 78.099611] CPU: 0 PID: 2997 Comm: devmem Tainted: G O 5.10.35 #1
[ 78.099612] Hardware name: NXP Layerscape LX2160ARDB (DT)
[ 78.099614] Call trace:
[ 78.099615] dump_backtrace+0x0/0x1a8
[ 78.099617] show_stack+0x18/0x68
[ 78.099618] dump_stack+0xd0/0x12c
[ 78.099619] panic+0x16c/0x334
[ 78.099621] nmi_panic+0x8c/0x90
[ 78.099622] arm64_serror_panic+0x78/0x84
[ 78.099624] do_serror+0x38/0x98
[ 78.099625] el0_error_naked+0x14/0x1c
[ 78.099656] SMP: stopping secondary CPUs
[ 78.099658] Kernel Offset: 0x2b5cfb600000 from 0xffff800010000000
[ 78.099660] PHYS_OFFSET: 0xffff993740000000
[ 78.099661] CPU features: 0x0240022,21806008
[ 78.099663] Memory Limit: none

----------------------------------------------------------------------------

Rose

0 Kudos
Reply

982 Views
yipingwang
NXP TechSupport
NXP TechSupport

My previous decode would only be valid if the value of address is less than 2GB. but in your case the address is beyond the 2GB please examine below:

 

the address listed in ECC register is from the DDR controller BNDS registers --> 0x3_e0bf_ff00 . when you want to read this location from SoC you need to map it to the System memory map.

 

in RM examine the System memory map.

DRAM region #1 (0-2 GB of memory controller ) maps to 0x8000_0000 to 0xFFFF_FFFF DRAM region #2 ( 2GB to 128GB) maps to 0x20_8000_0000 to 0x3F_FFFF_FFFF

 

the address 0x3_e0bf_ff00 would fall in region #2 and it would map to system memory map with the following calculation:

 

0x3_e0bf_ff00 - 0x8000_0000 + 0x20_8000_0000 = 0x23_e0bf_ff00

0 Kudos
Reply

1,011 Views
yipingwang
NXP TechSupport
NXP TechSupport

I will discuss it with the AE team.

0 Kudos
Reply