AnsweredAssumed Answered

t1042 PCIe errors at high speed

Question asked by THORSTEN POHLMANN on Sep 21, 2016
Latest reply on Mar 15, 2017 by Ji Hyun Yoon

Hi!

T1042E and the NXP Linux SDK 2.0 (also tried it with 1.9)

 

We have a Altera EP4CGX22 FPGA attached via PCIe (FPGA is EP, CPU is RC), this is happening on different boards and fpga-test-cards from different vendors.

Two memory blocks from the PPC are mapped with DMA_ALLOC_COHERENT() (128 kB), the dma_handles are stored in the fpga, which (by now) does nothing more but read 2kB from cpu's memory block #1 and write the data to block #2. No irqs, no more code in linux but start the fpga.

 

Everything works fine, data is read correctly and stored correctly...

As long as there is a hardcoded delay of about 2microsec between every 256 byte burst-write from the fpga.

 

Without that artificial delay this happens immediatly after telling the fpga to start (by writing to a bar mapped flag):

(It only happens when the FPGA does a write to the cpu memory; reads work as fast as possible without any problem)

 

PCIe error(s) detected
PCIe ERR_DR register: 0x00800000              <- This says "Outbound ATMU crossing"??
PCIe ERR_CAP_STAT register: 0x80000001
Machine check in kernel mode.
Caused by (from MCSR=a000): Load Error Report
Guarded Load Error Report
Oops: Machine check, sig: 7 [#1]
SMP NR_CPUS=8 CoreNet Generic
Modules linked in: tetronik_fpga(+)
CPU: 3 PID: 943 Comm: modprobe Not tainted 4.1.8-rt8-v0.1+ #5
task: e917c050 ti: effc8000 task.ti: e925c000
NIP: f1d3e290 LR: f1d3e27c CTR: c036ed80
REGS: effc9f10 TRAP: 0204   Not tainted  (4.1.8-rt8-v0.1+)
MSR: 00029002 <CE,EE,ME>  CR: 22008882  XER: 20000000
DEAR: f1d44040 ESR: 00800000
GPR00: f1d3e27c e925dc70 e917c050 00000020 00029002 000000b4 c036fdd0 0000000f
GPR08: c09408e0 00000000 c09408e0 00000230 22008822 100a625c 75666665 20646572
GPR16: 20435055 2e313233 34353635 37383930 31323334 35363738 39303132 f1d3ebc8
GPR24: f1d3ebd4 00000000 f1d3ebe0 f1d3ebec 2e313233 f1d40000 f1d44000 f1d3ef00
NIP [f1d3e290] fpga_probe+0xd80/0xe50 [tetronik_fpga]
LR [f1d3e27c] fpga_probe+0xd6c/0xe50 [tetronik_fpga]
Call Trace:
[e925dc70] [f1d3e27c] fpga_probe+0xd6c/0xe50 [tetronik_fpga] (unreliable)
[e925dcd0] [c03274a8] pci_device_probe+0x98/0x100
[e925dcf0] [c038808c] really_probe+0x17c/0x340
[e925dd10] [c03883a8] __driver_attach+0xc8/0xd0
[e925dd30] [c0385e9c] bus_for_each_dev+0x6c/0xc0
[e925dd60] [c03875e8] bus_add_driver+0x178/0x250
[e925dd80] [c0388928] driver_register+0x88/0x140

 

PCIe ERR_CAP_R0 register: 0x00000800
PCIe ERR_CAP_R1 register: 0x00000000
PCIe ERR_CAP_R2 register: 0x00000000
PCIe ERR_CAP_R3 register: 0x00000000
PCIe error(s) detected
PCIe ERR_DR register: 0x80400000
PCIe ERR_CAP_STAT register: 0x80000001
PCIe ERR_CAP_R0 register: 0x00000800
PCIe ERR_CAP_R1 register: 0x00000000
PCIe ERR_CAP_R2 register: 0x00000000
PCIe ERR_CAP_R3 register: 0x00000000
pcieport 0003:08:00.0: AER: Multiple Uncorrected (Non-Fatal) error received: id=0000
PCIe error(s) detected
PCIe ERR_DR register: 0x00400000
PCIe ERR_CAP_STAT register: 0x80000001
PCIe ERR_CAP_R0 register: 0x00000800
PCIe ERR_CAP_R1 register: 0x00000000
PCIe ERR_CAP_R2 register: 0x00000000
PCIe ERR_CAP_R3 register: 0x00000000

 

Multiple repeations of these, then

 

Unable to handle kernel paging request for instruction fetch

 

pcieport 0003:08:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0800(Requester ID)
pcieport 0003:08:00.0:   device [1957:0824] error status/mask=00004000/00000000
pcieport 0003:08:00.0:    [14] Completion Timeout     (First)
pcieport 0003:08:00.0: broadcast error_detected message
INFO: rcu_sched detected stalls on CPUs/tasks:
        (detected by 0, t=5252 jiffies, g=-179, c=-180, q=7)
All QSes seen, last rcu_sched kthread activity 5085 (-51805--56890), jiffies_till_next_fqs=1, root ->qsmask 0x0
swapper/0       R running      0     0      0 0x00000000
Call Trace:
[c0979d20] [c0076348] rcu_check_callbacks+0x758/0x760 (unreliable)
[c0979d80] [c00796bc] update_process_times+0x3c/0x70
[c0979d90] [c008e018] tick_sched_timer+0x68/0xd0
[c0979dc0] [c007a4f4] __run_hrtimer.isra.34+0x54/0xe0
[c0979de0] [c007ada8] hrtimer_interrupt+0xf8/0x300
[c0979e40] [c000952c] __timer_interrupt+0xac/0x1b0
[c0979e60] [c0009870] timer_interrupt+0xc0/0xf0
[c0979e80] [c000f610] ret_from_except+0x0/0x18
--- interrupt: 901 at arch_cpu_idle+0x24/0x70
    LR = arch_cpu_idle+0x24/0x70
[c0979f40] [c0073dd8] rcu_idle_enter+0xa8/0xe0 (unreliable)
[c0979f50] [c0060594] cpu_startup_entry+0x1b4/0x240
[c0979fb0] [c08ee9c0] start_kernel+0x3a4/0x3b8
[c0979ff0] [c00003d8] set_ivor+0x140/0x17c
rcu_sched kthread starved for 5085 jiffies!
Faulting instruction address: 0x00000010
Oops: Kernel access of bad area, sig: 11 [#2]
SMP NR_CPUS=8 CoreNet Generic
Modules linked in: tetronik_fpga(+)
CPU: 3 PID: 0 Comm: swapper/3 Tainted: G      D         4.1.8-rt8-v0.1+ #5
task: e9064c90 ti: e906a000 task.ti: e906a000
NIP: 00000010 LR: 00000010 CTR: c0054d40
REGS: e906be50 TRAP: 0400   Tainted: G      D          (4.1.8-rt8-v0.1+)
MSR: 00021002 <CE,ME>  CR: 00000010  XER: 00000000

 

 

 

Linux version 4.1.8-rt8-v0.1+ (gcc version 4.8.2 (GCC) #5 SMP

 

lspci -vv:

 

0003:09:00.0 Unassigned class [ff00]: Altera Corporation Device 0004 (rev 01)
        Subsystem: Altera Corporation Device 0004
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 43
        Region 0: Memory at c30000000 (32-bit, non-prefetchable) [size=128M]
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [80] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s unlimited, L1 unlimited
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed+ WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-

 

>cat /proc/iomem
00000000-7fffffff : System RAM
c00000000-c0fffffff : /pcie@ffe240000
  c00000000-c0fffffff : PCI Bus 0000:01
c10000000-c1fffffff : /pcie@ffe250000
  c10000000-c1fffffff : PCI Bus 0001:03
c20000000-c2fffffff : /pcie@ffe260000
  c20000000-c2fffffff : PCI Bus 0002:05
c30000000-c3fffffff : /pcie@ffe270000
  c30000000-c3fffffff : PCI Bus 0003:09
    c30000000-c37ffffff : 0003:09:00.0

 

 

 

Any ideas what these dumps try to tell? Why does a write from an EP generate a "outgoing window" error?

 

Thanks for your help!

 

 

Outcomes