ls1046a kernel panic in xhci_irq. How do I parse NXP custom ISS field in esr_el1?

キャンセル
次の結果を表示 
表示  限定  | 次の代わりに検索 
もしかして: 

ls1046a kernel panic in xhci_irq. How do I parse NXP custom ISS field in esr_el1?

ソリューションへジャンプ
831件の閲覧回数
AbelianMeme
Contributor III

I am trying to resolve a problem I'm having with a kernel panic on the LS1046a. Our custom board has a PCIe switch and 12 PCIe slots, with each PCIe slot containing an ASM3142 dual USB host controller.

Intermittently, I receive kernel panics from routines related to the Linux XHCI implementation. These always seem to occur in 1 of 2 places. Either in the xhci_irq routine, or the xhci_handshake routine, but in either location, the panic always occurs on the instruction:

dmb oshld

This is a data memory barrier instruction, and the panic is caused by an SError (shown below).  I have no idea what could possibly cause an SError at this instruction, as there were no obvious illegal memory accesses prior to this. I am hoping the ESR_EL1 register may offer some clues, but I don't know how to parse it.

According to the AArch64-Registers document on developer.arm.com, the ESR_EL1 register value 0xbf000002 can be parsed as:

(bits 31:26) ESR_ESL1.EC = 0b101111   ==>  SErrror

(bit 25)  ESR_ESL1.IL = 0b1 (32 bit instruction)

(bit 24)  ESR_ESL1. = 0b1 (Implementation defined)

Then, ISS bits 23:0 are an implementation defined value and I can't find any documentation that explains NXP's custom codes for the arm64 implementation in the LS1046a.

 

Does anyone know how to interpret the ISS value 0x000002 on an SError exception for  the LS1046a? Or where I can download something that documents this? Importantly, does this offer any assistance at all in figuring out why a data memory barrier command would generate such an exception?

Alternatively, does anyone have any clue at all what might be happening here?  I'm stumped on what could be the problem.

 

Any assistance is appreciated.

 

------------------------------

[ 353.446849] SError Interrupt on CPU0, code 0x00000000bf000002 -- SError
[ 353.446857] CPU: 0 PID: 185 Comm: kworker/0:3 Tainted: G O 6.1.41-devel #49
[ 353.446862] Hardware name: LS1046A Copier Board (DT)
[ 353.446864] Workqueue: events xhci_handle_command_timeout
[ 353.446873] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 353.446877] pc : xhci_handshake+0x68/0x110
[ 353.446884] lr : xhci_handshake+0x60/0x110
[ 353.446887] sp : ffffffc00adf3b40
[ 353.446889] x29: ffffffc00adf3b40 x28: 0000000000000000 x27: 0000000000000000
[ 353.446895] x26: 0000000000000000 x25: ffffffc00b379210 x24: 0000000000000000
[ 353.446899] x23: 0000000000000000 x22: 0000000000000008 x21: ffffffc00b360038
[ 353.446904] x20: 00000053693ee8a8 x19: 00000000004c4b40 x18: ffffffc00e463c88
[ 353.446909] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000001
[ 353.446913] x14: 00000000000003bb x13: 0000000000000000 x12: 0000000000000000
[ 353.446917] x11: 0000000000000000 x10: 00000000000009e0 x9 : ffffffc00adf3d70
[ 353.446922] x8 : ffffff8801c7fb40 x7 : fefefefefefefeff x6 : 000000023074d81b
[ 353.446926] x5 : 00ffffffffffffff x4 : 002e7ddb00000000 x3 : 0000000000000018
[ 353.446931] x2 : 0000000000000000 x1 : ffffffc00adf3b00 x0 : 0000000000000000
[ 353.446936] Kernel panic - not syncing: Asynchronous SError Interrupt
[ 353.446938] CPU: 0 PID: 185 Comm: kworker/0:3 Tainted: G O 6.1.41-devel ##4
[ 353.446942] Hardware name: LS1046A Copier Board (DT)
[ 353.446943] Workqueue: events xhci_handle_command_timeout
[ 353.446947] Call trace:
[ 353.446948] dump_backtrace+0xf0/0x130
[ 353.446955] show_stack+0x18/0x28
[ 353.446959] dump_stack_lvl+0x68/0x84
[ 353.446965] dump_stack+0x18/0x34
[ 353.446970] panic+0x1a0/0x998
[ 353.446974] nmi_panic+0xac/0xb0
[ 353.446979] arm64_serror_panic+0x64/0x78
[ 353.446982] do_serror+0x34/0x80
[ 353.446984] el1h_64_error_handler+0x34/0x50
[ 353.446987] el1h_64_error+0x64/0x68
[ 353.446990] xhci_handshake+0x68/0x110
[ 353.446994] xhci_handle_command_timeout+0x180/0x5c0
[ 353.446997] process_one_work+0x1fc/0x350
[ 353.447001] worker_thread+0x44/0x440
[ 353.447004] kthread+0xf8/0x110
[ 353.447007] ret_from_fork+0x10/0x20

 

0 件の賞賛
返信
1 解決策
770件の閲覧回数
yipingwang
NXP TechSupport
NXP TechSupport

Refer to the doc of "ARM Cortex-A72 MPCore Processor" for the interpretation of "ISS" filed in SError exception.

ISS=2 means "Slave Error". It might related to a PCIe Error.

Please comment out "ls_pcie_fix_error_response" in "drivers/pci/controller/dwc/pci-layerscape.c" and run the test again, then share the crash log




元の投稿で解決策を見る

0 件の賞賛
返信
2 返答(返信)
771件の閲覧回数
yipingwang
NXP TechSupport
NXP TechSupport

Refer to the doc of "ARM Cortex-A72 MPCore Processor" for the interpretation of "ISS" filed in SError exception.

ISS=2 means "Slave Error". It might related to a PCIe Error.

Please comment out "ls_pcie_fix_error_response" in "drivers/pci/controller/dwc/pci-layerscape.c" and run the test again, then share the crash log




0 件の賞賛
返信
800件の閲覧回数
yipingwang
NXP TechSupport
NXP TechSupport

Discussing with the AE team, will provide more update later.

0 件の賞賛
返信