I am trying to resolve a problem I'm having with a kernel panic on the LS1046a. Our custom board has a PCIe switch and 12 PCIe slots, with each PCIe slot containing an ASM3142 dual USB host controller.
Intermittently, I receive kernel panics from routines related to the Linux XHCI implementation. These always seem to occur in 1 of 2 places. Either in the xhci_irq routine, or the xhci_handshake routine, but in either location, the panic always occurs on the instruction:
dmb oshld
This is a data memory barrier instruction, and the panic is caused by an SError (shown below). I have no idea what could possibly cause an SError at this instruction, as there were no obvious illegal memory accesses prior to this. I am hoping the ESR_EL1 register may offer some clues, but I don't know how to parse it.
According to the AArch64-Registers document on developer.arm.com, the ESR_EL1 register value 0xbf000002 can be parsed as:
(bits 31:26) ESR_ESL1.EC = 0b101111 ==> SErrror
(bit 25) ESR_ESL1.IL = 0b1 (32 bit instruction)
(bit 24) ESR_ESL1. = 0b1 (Implementation defined)
Then, ISS bits 23:0 are an implementation defined value and I can't find any documentation that explains NXP's custom codes for the arm64 implementation in the LS1046a.
Does anyone know how to interpret the ISS value 0x000002 on an SError exception for the LS1046a? Or where I can download something that documents this? Importantly, does this offer any assistance at all in figuring out why a data memory barrier command would generate such an exception?
Alternatively, does anyone have any clue at all what might be happening here? I'm stumped on what could be the problem.
Any assistance is appreciated.
------------------------------
[ 353.446849] SError Interrupt on CPU0, code 0x00000000bf000002 -- SError
[ 353.446857] CPU: 0 PID: 185 Comm: kworker/0:3 Tainted: G O 6.1.41-devel #49
[ 353.446862] Hardware name: LS1046A Copier Board (DT)
[ 353.446864] Workqueue: events xhci_handle_command_timeout
[ 353.446873] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 353.446877] pc : xhci_handshake+0x68/0x110
[ 353.446884] lr : xhci_handshake+0x60/0x110
[ 353.446887] sp : ffffffc00adf3b40
[ 353.446889] x29: ffffffc00adf3b40 x28: 0000000000000000 x27: 0000000000000000
[ 353.446895] x26: 0000000000000000 x25: ffffffc00b379210 x24: 0000000000000000
[ 353.446899] x23: 0000000000000000 x22: 0000000000000008 x21: ffffffc00b360038
[ 353.446904] x20: 00000053693ee8a8 x19: 00000000004c4b40 x18: ffffffc00e463c88
[ 353.446909] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000001
[ 353.446913] x14: 00000000000003bb x13: 0000000000000000 x12: 0000000000000000
[ 353.446917] x11: 0000000000000000 x10: 00000000000009e0 x9 : ffffffc00adf3d70
[ 353.446922] x8 : ffffff8801c7fb40 x7 : fefefefefefefeff x6 : 000000023074d81b
[ 353.446926] x5 : 00ffffffffffffff x4 : 002e7ddb00000000 x3 : 0000000000000018
[ 353.446931] x2 : 0000000000000000 x1 : ffffffc00adf3b00 x0 : 0000000000000000
[ 353.446936] Kernel panic - not syncing: Asynchronous SError Interrupt
[ 353.446938] CPU: 0 PID: 185 Comm: kworker/0:3 Tainted: G O 6.1.41-devel ##4
[ 353.446942] Hardware name: LS1046A Copier Board (DT)
[ 353.446943] Workqueue: events xhci_handle_command_timeout
[ 353.446947] Call trace:
[ 353.446948] dump_backtrace+0xf0/0x130
[ 353.446955] show_stack+0x18/0x28
[ 353.446959] dump_stack_lvl+0x68/0x84
[ 353.446965] dump_stack+0x18/0x34
[ 353.446970] panic+0x1a0/0x998
[ 353.446974] nmi_panic+0xac/0xb0
[ 353.446979] arm64_serror_panic+0x64/0x78
[ 353.446982] do_serror+0x34/0x80
[ 353.446984] el1h_64_error_handler+0x34/0x50
[ 353.446987] el1h_64_error+0x64/0x68
[ 353.446990] xhci_handshake+0x68/0x110
[ 353.446994] xhci_handle_command_timeout+0x180/0x5c0
[ 353.446997] process_one_work+0x1fc/0x350
[ 353.447001] worker_thread+0x44/0x440
[ 353.447004] kthread+0xf8/0x110
[ 353.447007] ret_from_fork+0x10/0x20
Refer to the doc of "ARM Cortex-A72 MPCore Processor" for the interpretation of "ISS" filed in SError exception.
ISS=2 means "Slave Error". It might related to a PCIe Error.
Please comment out "ls_pcie_fix_error_response" in "drivers/pci/controller/dwc/pci-layerscape.c" and run the test again, then share the crash log
Discussing with the AE team, will provide more update later.