t1042 PCIe errors at high speed

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

t1042 PCIe errors at high speed

1,884 Views
thorstenpohlman
Contributor II

Hi!

T1042E and the NXP Linux SDK 2.0 (also tried it with 1.9)

We have a Altera EP4CGX22 FPGA attached via PCIe (FPGA is EP, CPU is RC), this is happening on different boards and fpga-test-cards from different vendors.

Two memory blocks from the PPC are mapped with DMA_ALLOC_COHERENT() (128 kB), the dma_handles are stored in the fpga, which (by now) does nothing more but read 2kB from cpu's memory block #1 and write the data to block #2. No irqs, no more code in linux but start the fpga.

Everything works fine, data is read correctly and stored correctly...

As long as there is a hardcoded delay of about 2microsec between every 256 byte burst-write from the fpga.

Without that artificial delay this happens immediatly after telling the fpga to start (by writing to a bar mapped flag):

(It only happens when the FPGA does a write to the cpu memory; reads work as fast as possible without any problem)

PCIe error(s) detected
PCIe ERR_DR register: 0x00800000              <- This says "Outbound ATMU crossing"??
PCIe ERR_CAP_STAT register: 0x80000001
Machine check in kernel mode.
Caused by (from MCSR=a000): Load Error Report
Guarded Load Error Report
Oops: Machine check, sig: 7 [#1]
SMP NR_CPUS=8 CoreNet Generic
Modules linked in: tetronik_fpga(+)
CPU: 3 PID: 943 Comm: modprobe Not tainted 4.1.8-rt8-v0.1+ #5
task: e917c050 ti: effc8000 task.ti: e925c000
NIP: f1d3e290 LR: f1d3e27c CTR: c036ed80
REGS: effc9f10 TRAP: 0204   Not tainted  (4.1.8-rt8-v0.1+)
MSR: 00029002 <CE,EE,ME>  CR: 22008882  XER: 20000000
DEAR: f1d44040 ESR: 00800000
GPR00: f1d3e27c e925dc70 e917c050 00000020 00029002 000000b4 c036fdd0 0000000f
GPR08: c09408e0 00000000 c09408e0 00000230 22008822 100a625c 75666665 20646572
GPR16: 20435055 2e313233 34353635 37383930 31323334 35363738 39303132 f1d3ebc8
GPR24: f1d3ebd4 00000000 f1d3ebe0 f1d3ebec 2e313233 f1d40000 f1d44000 f1d3ef00
NIP [f1d3e290] fpga_probe+0xd80/0xe50 [tetronik_fpga]
LR [f1d3e27c] fpga_probe+0xd6c/0xe50 [tetronik_fpga]
Call Trace:
[e925dc70] [f1d3e27c] fpga_probe+0xd6c/0xe50 [tetronik_fpga] (unreliable)
[e925dcd0] [c03274a8] pci_device_probe+0x98/0x100
[e925dcf0] [c038808c] really_probe+0x17c/0x340
[e925dd10] [c03883a8] __driver_attach+0xc8/0xd0
[e925dd30] [c0385e9c] bus_for_each_dev+0x6c/0xc0
[e925dd60] [c03875e8] bus_add_driver+0x178/0x250
[e925dd80] [c0388928] driver_register+0x88/0x140

PCIe ERR_CAP_R0 register: 0x00000800
PCIe ERR_CAP_R1 register: 0x00000000
PCIe ERR_CAP_R2 register: 0x00000000
PCIe ERR_CAP_R3 register: 0x00000000
PCIe error(s) detected
PCIe ERR_DR register: 0x80400000
PCIe ERR_CAP_STAT register: 0x80000001
PCIe ERR_CAP_R0 register: 0x00000800
PCIe ERR_CAP_R1 register: 0x00000000
PCIe ERR_CAP_R2 register: 0x00000000
PCIe ERR_CAP_R3 register: 0x00000000
pcieport 0003:08:00.0: AER: Multiple Uncorrected (Non-Fatal) error received: id=0000
PCIe error(s) detected
PCIe ERR_DR register: 0x00400000
PCIe ERR_CAP_STAT register: 0x80000001
PCIe ERR_CAP_R0 register: 0x00000800
PCIe ERR_CAP_R1 register: 0x00000000
PCIe ERR_CAP_R2 register: 0x00000000
PCIe ERR_CAP_R3 register: 0x00000000

Multiple repeations of these, then

Unable to handle kernel paging request for instruction fetch

pcieport 0003:08:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0800(Requester ID)
pcieport 0003:08:00.0:   device [1957:0824] error status/mask=00004000/00000000
pcieport 0003:08:00.0:    [14] Completion Timeout     (First)
pcieport 0003:08:00.0: broadcast error_detected message
INFO: rcu_sched detected stalls on CPUs/tasks:
        (detected by 0, t=5252 jiffies, g=-179, c=-180, q=7)
All QSes seen, last rcu_sched kthread activity 5085 (-51805--56890), jiffies_till_next_fqs=1, root ->qsmask 0x0
swapper/0       R running      0     0      0 0x00000000
Call Trace:
[c0979d20] [c0076348] rcu_check_callbacks+0x758/0x760 (unreliable)
[c0979d80] [c00796bc] update_process_times+0x3c/0x70
[c0979d90] [c008e018] tick_sched_timer+0x68/0xd0
[c0979dc0] [c007a4f4] __run_hrtimer.isra.34+0x54/0xe0
[c0979de0] [c007ada8] hrtimer_interrupt+0xf8/0x300
[c0979e40] [c000952c] __timer_interrupt+0xac/0x1b0
[c0979e60] [c0009870] timer_interrupt+0xc0/0xf0
[c0979e80] [c000f610] ret_from_except+0x0/0x18
--- interrupt: 901 at arch_cpu_idle+0x24/0x70
    LR = arch_cpu_idle+0x24/0x70
[c0979f40] [c0073dd8] rcu_idle_enter+0xa8/0xe0 (unreliable)
[c0979f50] [c0060594] cpu_startup_entry+0x1b4/0x240
[c0979fb0] [c08ee9c0] start_kernel+0x3a4/0x3b8
[c0979ff0] [c00003d8] set_ivor+0x140/0x17c
rcu_sched kthread starved for 5085 jiffies!
Faulting instruction address: 0x00000010
Oops: Kernel access of bad area, sig: 11 [#2]
SMP NR_CPUS=8 CoreNet Generic
Modules linked in: tetronik_fpga(+)
CPU: 3 PID: 0 Comm: swapper/3 Tainted: G      D         4.1.8-rt8-v0.1+ #5
task: e9064c90 ti: e906a000 task.ti: e906a000
NIP: 00000010 LR: 00000010 CTR: c0054d40
REGS: e906be50 TRAP: 0400   Tainted: G      D          (4.1.8-rt8-v0.1+)
MSR: 00021002 <CE,ME>  CR: 00000010  XER: 00000000

Linux version 4.1.8-rt8-v0.1+ (gcc version 4.8.2 (GCC) #5 SMP

lspci -vv:

0003:09:00.0 Unassigned class [ff00]: Altera Corporation Device 0004 (rev 01)
        Subsystem: Altera Corporation Device 0004
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 43
        Region 0: Memory at c30000000 (32-bit, non-prefetchable) [size=128M]
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [80] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s unlimited, L1 unlimited
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed+ WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-

>cat /proc/iomem
00000000-7fffffff : System RAM
c00000000-c0fffffff : /pcie@ffe240000
  c00000000-c0fffffff : PCI Bus 0000:01
c10000000-c1fffffff : /pcie@ffe250000
  c10000000-c1fffffff : PCI Bus 0001:03
c20000000-c2fffffff : /pcie@ffe260000
  c20000000-c2fffffff : PCI Bus 0002:05
c30000000-c3fffffff : /pcie@ffe270000
  c30000000-c3fffffff : PCI Bus 0003:09
    c30000000-c37ffffff : 0003:09:00.0

Any ideas what these dumps try to tell? Why does a write from an EP generate a "outgoing window" error?

Thanks for your help!

 

0 Kudos
1 Reply

1,109 Views
tachyon11
Contributor I

Is there any update ?

Thanks

0 Kudos