T1042 e5500 pci target abort freeze cpu core

antoinedurand · ‎11-12-2018

Hi,

On a t1042 target (linux), i talk to a pci EP. This device is a bridge to legacy, old school bus. It may generate PCI target abort in reaction of Bus ERRor on the underlying bus. This has to be handled in my application. All PCI transaction to this device are not posted in my context.

When it occurs, the core (on which pci device driver was running) freeze on the load instruction (load that ended in the PCI Target Abort).

I ve found many discussion about what look like similar issues on P2020, mpc85xx, etc.

And some Linux kernel patch trying to handle that:

Discussion :

PCIe errors causes CPU to crash

Freescale P2020 CPU Freeze over PCIe abort signal

And this (never accepted) patch (it did not help me)

powerpc/fsl: Add support for pci(e) machine check exception on E500MC / E5500 - Patchwork

Following this one:

[2/2,V8] powerpc/85xx: Add machine check handler to fix PCIe erratum on mpc85xx - Patchwork

Does anybody known if the mpc85xx's erratum mentionned in this last link is still applicable to e5500 ?

What I can see is that fsl pci edac driver interrupt handler run as the "PEX pcie RC logic" detect the target abort. So it seems to have a good behavior from the pci device EP to the t1042 SoC (through pcie switch and pcie/pci bridge)

I don't think any machine check exception is run at all. Because I don't see anything in the console, even after spreading printk() in the revelant function in arch/powerpc/kernel/traps.c.

And I read 0 for all core in the /proc/interrupts for the "machine check" entry.

But sometime I get a "bad kernel stack pointer" in the console, that is weird because it must come from exception handling.

Does someone understand if it is all about a known issue ?

is there any linked errata that apply for all e500 family including e5500 ?

Problem occurs with linux-4.1.8, linux-4.19.1, (for exemple) with or without fsl mpc85xx EDAC and AER drivers active.

Other symptoms:

The cpu is reported stall by other core in the console. No more jiffies count added for the freezed core in /proc/stat.

What i expect woulb be that a dedicated exception handler stop the load instruction, so that the core resume and application can deal with the problem.

Regards.

Thanks

liuxiang_1999 · ‎12-20-2018

Hi,
Do you solve the problem?
My P2020 board has the similar problem. In my board, P2020 connects a broadcom switch.
Sometimes the cpu core becomes stall when transfering data via PCI bus. But it is very
hard to repeat this. How do you produce the PCI abort situation?

By the way, I find that in e5500 datasheet:
Note that there is no MCSR error status bit for CoreNet data errors. If a CoreNet data error occurs on a
load or instruction fetch and the instruction reaches the bottom of the completion buffer, an error report
occurs. But, because there is no MCSR error status bit for data errors, the core does not generate an
asynchronous machine check. The device that detects the error is expected to report it. For example,
assume that the core attempts to perform a load from a PCI device that encounters an error. The PCI device
would signal a “PCI Master Abort” and would signal the error to the programmable interrupt controller
(PIC).
The core's memory transaction should be completed with a data error so that the core is not hung awaiting
the transaction. Eventually, the PIC should interrupt the core (the PIC should be programmed to direct such
an error to take a machine check interrupt).
Error reports are intended to be a mechanism to stop the propagation of bad data; the asynchronous
machine check is intended to allow software to attempt to recover from errors gracefully.

So by default PCI error can not generate machine check, you may need to program it.

I hope this is helpful for you.

antoinedurand · ‎11-19-2018

the first and last bullet of my previous message are not a Linux version issue, the problem was the use of CONFIG_IRQSOFF_TRACER on the 4.19.1 (and not in the 3.12.19) !

So I now get a single PCI EDAC interrupt when a "PCI target Abort" occurs

and a "Machine Check Execption in kernel Mode" with any Linux kernel.

The goal known is to try to recover in fsl_pci_mcheck_exception() ...

antoinedurand · ‎11-14-2018

Hi,

Actually there are multiple problems to solve one by one :

First, on linux-3.12.19 (3.12.19-rt30-QorIQ-SDK-V1.7) a PCI target Abort end in a machine_check_exception(). On linux-4.19.1, it end "somewhere" (I didn't undestand where, maybe an exception which bug before to reach machine_check_exception function). I will diff the arch/powerpc/kernel/traps.c and exceptions-64e.S to try to get machine_check_exception() called on linux-4.19.1 too.

Then, as machine_check_exception() will be called, the patch mentionned above can be tried : powerpc/fsl: Add support for pci(e) machine check exception on E500MC / E5500 - Patchwork but still, this patch is not fully valid. as discussed in the link, the E5500 core does not report the fault address in the MCAR register. and the faulty address is not always in SPRN_DEAR register either ! (I've check that myself on linux-3.12.19, we can't rely on this, sometime ioread() addr is there, sometime not). So may be there is not any good solution to the problem.
In exception handling, with half of the patch applied (to get in fsl_pci_mcheck_exception()), I will try to use my pci device driver to check in its registers if it send back a PCI Target Abort in a previous coupled (non posted) transaction. If so, clear this flag and consider exception as coming from this PCI device and recover (userspace application will cleanup and exit), else continue Machine Check Exception : (Guarded Load Error, caused by MCSR=a000, Load Error Report). Obviously this will be very specific to my application and not fully robust...
Last, a parallel different problem has to be corrected : on a PCI Target Abort, the PEX Irq (that is a level interrupt) will stay High forever, even with EDAC PCI driver active. The EDAC PCI irq handler clear the flag (PCAC bit in PEXx_PEX_ERR_DR register) but it doesn't suffice to clear the IRQ line, so irq handler is continuously called, (or if EDAC is not active, it end in 'Disabling Irq' after 100000 interrupts without caring handler).

If someone think about a point I could have miss...

Thank you,

Regards.

antoinedurand · ‎11-13-2018

I tried to debug with KGDB, but it freezes the same way and doesn't let any chance to understand what is happening.

powerpc64-unknown-linux-gnu-gdb ./vmlinux
GNU gdb (crosstool-NG crosstool-ng-1.22.0) 7.10
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "--host=x86_64-build_pc-linux-gnu --target=powerpc64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./vmlinux...done.
(gdb) target remote /dev/tty_dgrp_a_1
Remote debugging using /dev/tty_dgrp_a_1
Ignoring packet error, continuing...
0xc0000000000fd0e0 in arch_kgdb_breakpoint () at kernel/debug/debug_core.c:1071
1071        wmb(); /* Sync point before breakpoint */
(gdb) b ioread16be
Breakpoint 1 at 0xc000000000027230: file ./arch/powerpc/include/asm/io.h, line 169.
(gdb) c
Continuing.
[New Thread 1403]
[Switching to Thread 1403]
Breakpoint 1, ioread16be (addr=0x8000080090113000) at arch/powerpc/kernel/iomap.c:28
28        return readw_be(addr);
(gdb) s
readw_be (addr=<optimized out>) at arch/powerpc/kernel/iomap.c:28
28        return readw_be(addr);
(gdb) s
in_be16 (addr=<optimized out>) at ./arch/powerpc/include/asm/io.h:169
169    DEF_MMIO_IN_D(in_be16, 16, lhz);
(gdb) l
164    
165    DEF_MMIO_IN_D(in_8,     8, lbz);
166    DEF_MMIO_OUT_D(out_8,   8, stb);
167    
168    #ifdef __BIG_ENDIAN__
169    DEF_MMIO_IN_D(in_be16, 16, lhz);
170    DEF_MMIO_IN_D(in_be32, 32, lwz);
171    DEF_MMIO_IN_X(in_le16, 16, lhbrx);
172    DEF_MMIO_IN_X(in_le32, 32, lwbrx);
173    
(gdb) s

It freeze here, forever !

same behavior in all tried config :

with or without multi-core

with or without PREEMPT-RT patch applied and active