AnsweredAssumed Answered

Network failure w/ fsl-gianfar "transmit queue N timed out" traceback

Question asked by David Dadson on Jul 27, 2016
Latest reply on Nov 3, 2016 by David Dadson

Hi,

 

When running P1022 -based board running an QorIQ SDK 1.9 kernel (3.12.37-rt51) with a 100Mb/s ethernet connection to an HP2520-24 switch. Our PHY is a KSZ9031RNXCA (which differs from the Freescale P1022DS reference board). About once or twice a day we're encountering the following kernel traceback:

 

...

NETDEV WATCHDOG: eth0 (fsl-gianfar): transmit queue 1 timed out

------------[ cut here ]------------

WARNING: at /home/ddadson/sandbox/gateway_4g/externalsrc/linux-qoriq-sdk/net/sched/sch_generic.c:279

Modules linked in: cxd2850(PO) allachie(O) mhor(O) cadam(O) gwdcb(O) tabasco(O)

CPU: 0 PID: 0 Comm: swapper/0 Tainted: P           O 3.12.37-rt51-Lochranza-Gateway_4G-=1.4.0=-unknown-dev #2

task: c1136300 ti: effea000 task.ti: c1178000

NIP: c053c648 LR: c053c648 CTR: c03450a4

REGS: effebe50 TRAP: 0700   Tainted: P           O  (3.12.37-rt51-Lochranza-Gateway_4G-=1.4.0=-unknown-dev)

MSR: 00029000 <CE,EE,ME>  CR: 44044022  XER: 20000000

 

GPR00: c053c648 effebf00 c1136300 0000003f c1abf484 c1abf9d0 0098e000 effea000

GPR08: 00000007 00000000 0098e000 000001af 24044084 1001a678 c0721ac0 00000100

GPR16: effea000 00000000 c1132200 c117d060 01c4bc65 c117d0c4 c128eb14 ffffffff

GPR24: 00000000 00000000 00000004 effea000 c1170000 c1180000 ee0bd000 00000001

NIP [c053c648] dev_watchdog+0x2e0/0x2f0

LR [c053c648] dev_watchdog+0x2e0/0x2f0

Call Trace:

[effebf00] [c053c648] dev_watchdog+0x2e0/0x2f0 (unreliable)

[effebf30] [c004a86c] call_timer_fn.isra.29+0x28/0x84

[effebf50] [c004aa48] run_timer_softirq+0x180/0x1fc

[effebf90] [c00434b8] __do_softirq+0x10c/0x1d8

[effebff0] [c000da5c] call_do_softirq+0x24/0x3c

[c1179e60] [c0004924] do_softirq+0x8c/0xb4

[c1179e80] [c0043e6c] irq_exit+0xa4/0xc8

[c1179e90] [c0009edc] timer_interrupt+0x1a4/0x1d4

[c1179ec0] [c000fb08] ret_from_except+0x0/0x18

--- Exception: 901 at arch_cpu_idle+0x24/0x5c

    LR = arch_cpu_idle+0x24/0x5c

[c1179f80] [c00a7990] rcu_idle_enter+0xac/0xec (unreliable)

[c1179f90] [c0080454] cpu_startup_entry+0x114/0x164

[c1179fc0] [c07e47bc] start_kernel+0x2dc/0x2f0

[c1179ff0] [c00003fc] skpinv+0x2e8/0x324

Instruction dump:

4e800421 80be0204 4bffff48 7fc3f378 4bfe7c01 7fc4f378 7c651b78 3c60c07a

7fe6fb78 38639af0 4cc63182 480c50e9 <0fe00000> 39200001 993cb406 4bffffb4

---[ end trace 49ed838668e6193f ]---

...

 

After this happens there appears to be no successful network traffic outbound (we haven't been able to check inbound yet), but the switch does see continuous bad transmissions coming from the unit. It seems to occur under moderate load conditions (around 15-20Mb/s fairly steady transmission). Very heavy (e.g. nearly 100Mb/s) and very light network traffic doesn't seem to trigger the issue, and it seems to be very sensitive to something as yet unidentified as we are struggling to replicate it so consistently on other units (although it has been seen).

 

The network can recovered by using ifconfig down/ifconfig up via the serial port.

 

No issue has been seen with the same unit/config/switch but at 1Gb/s.

 

Are there any known issues like this, and any suggested fixes/workarounds?

 

Thanks for any help,

David Dadson

Outcomes