On a t1042 target (linux), i talk to a pci EP. This device is a bridge to legacy, old school bus. It may generate PCI target abort in reaction of Bus ERRor on the underlying bus. This has to be handled in my application. All PCI transaction to this device are not posted in my context.
When it occurs, the core (on which pci device driver was running) freeze on the load instruction (load that ended in the PCI Target Abort).
I ve found many discussion about what look like similar issues on P2020, mpc85xx, etc.
And some Linux kernel patch trying to handle that:
And this (never accepted) patch (it did not help me)
Following this one:
Does anybody known if the mpc85xx's erratum mentionned in this last link is still applicable to e5500 ?
What I can see is that fsl pci edac driver interrupt handler run as the "PEX pcie RC logic" detect the target abort. So it seems to have a good behavior from the pci device EP to the t1042 SoC (through pcie switch and pcie/pci bridge)
I don't think any machine check exception is run at all. Because I don't see anything in the console, even after spreading printk() in the revelant function in arch/powerpc/kernel/traps.c.
And I read 0 for all core in the /proc/interrupts for the "machine check" entry.
But sometime I get a "bad kernel stack pointer" in the console, that is weird because it must come from exception handling.
Does someone understand if it is all about a known issue ?
is there any linked errata that apply for all e500 family including e5500 ?
Problem occurs with linux-4.1.8, linux-4.19.1, (for exemple) with or without fsl mpc85xx EDAC and AER drivers active.
The cpu is reported stall by other core in the console. No more jiffies count added for the freezed core in /proc/stat.
What i expect woulb be that a dedicated exception handler stop the load instruction, so that the core resume and application can deal with the problem.
Do you solve the problem?
My P2020 board has the similar problem. In my board, P2020 connects a broadcom switch.
Sometimes the cpu core becomes stall when transfering data via PCI bus. But it is very
hard to repeat this. How do you produce the PCI abort situation?
By the way, I find that in e5500 datasheet:
Note that there is no MCSR error status bit for CoreNet data errors. If a CoreNet data error occurs on a
load or instruction fetch and the instruction reaches the bottom of the completion buffer, an error report
occurs. But, because there is no MCSR error status bit for data errors, the core does not generate an
asynchronous machine check. The device that detects the error is expected to report it. For example,
assume that the core attempts to perform a load from a PCI device that encounters an error. The PCI device
would signal a “PCI Master Abort” and would signal the error to the programmable interrupt controller
The core's memory transaction should be completed with a data error so that the core is not hung awaiting
the transaction. Eventually, the PIC should interrupt the core (the PIC should be programmed to direct such
an error to take a machine check interrupt).
Error reports are intended to be a mechanism to stop the propagation of bad data; the asynchronous
machine check is intended to allow software to attempt to recover from errors gracefully.
So by default PCI error can not generate machine check, you may need to program it.
I hope this is helpful for you.
the first and last bullet of my previous message are not a Linux version issue, the problem was the use of CONFIG_IRQSOFF_TRACER on the 4.19.1 (and not in the 3.12.19) !
So I now get a single PCI EDAC interrupt when a "PCI target Abort" occurs
and a "Machine Check Execption in kernel Mode" with any Linux kernel.
The goal known is to try to recover in fsl_pci_mcheck_exception() ...
Actually there are multiple problems to solve one by one :
If someone think about a point I could have miss...
I tried to debug with KGDB, but it freezes the same way and doesn't let any chance to understand what is happening.
powerpc64-unknown-linux-gnu-gdb ./vmlinux GNU gdb (crosstool-NG crosstool-ng-1.22.0) 7.10 Copyright (C) 2015 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "--host=x86_64-build_pc-linux-gnu --target=powerpc64-unknown-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from ./vmlinux...done. (gdb) target remote /dev/tty_dgrp_a_1 Remote debugging using /dev/tty_dgrp_a_1 Ignoring packet error, continuing... 0xc0000000000fd0e0 in arch_kgdb_breakpoint () at kernel/debug/debug_core.c:1071 1071 wmb(); /* Sync point before breakpoint */ (gdb) b ioread16be Breakpoint 1 at 0xc000000000027230: file ./arch/powerpc/include/asm/io.h, line 169. (gdb) c Continuing. [New Thread 1403] [Switching to Thread 1403] Breakpoint 1, ioread16be (addr=0x8000080090113000) at arch/powerpc/kernel/iomap.c:28 28 return readw_be(addr); (gdb) s readw_be (addr=<optimized out>) at arch/powerpc/kernel/iomap.c:28 28 return readw_be(addr); (gdb) s in_be16 (addr=<optimized out>) at ./arch/powerpc/include/asm/io.h:169 169 DEF_MMIO_IN_D(in_be16, 16, lhz); (gdb) l 164 165 DEF_MMIO_IN_D(in_8, 8, lbz); 166 DEF_MMIO_OUT_D(out_8, 8, stb); 167 168 #ifdef __BIG_ENDIAN__ 169 DEF_MMIO_IN_D(in_be16, 16, lhz); 170 DEF_MMIO_IN_D(in_be32, 32, lwz); 171 DEF_MMIO_IN_X(in_le16, 16, lhbrx); 172 DEF_MMIO_IN_X(in_le32, 32, lwbrx); 173 (gdb) s It freeze here, forever !
same behavior in all tried config :
with or without multi-core
with or without PREEMPT-RT patch applied and active