Network failure w/ fsl-gianfar "transmit queue N timed out" traceback

ddadson · ‎07-27-2016

Hi,

When running P1022 -based board running an QorIQ SDK 1.9 kernel (3.12.37-rt51) with a 100Mb/s ethernet connection to an HP2520-24 switch. Our PHY is a KSZ9031RNXCA (which differs from the Freescale P1022DS reference board). About once or twice a day we're encountering the following kernel traceback:

...

NETDEV WATCHDOG: eth0 (fsl-gianfar): transmit queue 1 timed out

------------[ cut here ]------------

WARNING: at /home/ddadson/sandbox/gateway_4g/externalsrc/linux-qoriq-sdk/net/sched/sch_generic.c:279

Modules linked in: cxd2850(PO) allachie(O) mhor(O) cadam(O) gwdcb(O) tabasco(O)

CPU: 0 PID: 0 Comm: swapper/0 Tainted: P O 3.12.37-rt51-Lochranza-Gateway_4G-=1.4.0=-unknown-dev #2

task: c1136300 ti: effea000 task.ti: c1178000

NIP: c053c648 LR: c053c648 CTR: c03450a4

REGS: effebe50 TRAP: 0700 Tainted: P O (3.12.37-rt51-Lochranza-Gateway_4G-=1.4.0=-unknown-dev)

MSR: 00029000 <CE,EE,ME> CR: 44044022 XER: 20000000

GPR00: c053c648 effebf00 c1136300 0000003f c1abf484 c1abf9d0 0098e000 effea000

GPR08: 00000007 00000000 0098e000 000001af 24044084 1001a678 c0721ac0 00000100

GPR16: effea000 00000000 c1132200 c117d060 01c4bc65 c117d0c4 c128eb14 ffffffff

GPR24: 00000000 00000000 00000004 effea000 c1170000 c1180000 ee0bd000 00000001

NIP [c053c648] dev_watchdog+0x2e0/0x2f0

LR [c053c648] dev_watchdog+0x2e0/0x2f0

Call Trace:

[effebf00] [c053c648] dev_watchdog+0x2e0/0x2f0 (unreliable)

[effebf30] [c004a86c] call_timer_fn.isra.29+0x28/0x84

[effebf50] [c004aa48] run_timer_softirq+0x180/0x1fc

[effebf90] [c00434b8] __do_softirq+0x10c/0x1d8

[effebff0] [c000da5c] call_do_softirq+0x24/0x3c

[c1179e60] [c0004924] do_softirq+0x8c/0xb4

[c1179e80] [c0043e6c] irq_exit+0xa4/0xc8

[c1179e90] [c0009edc] timer_interrupt+0x1a4/0x1d4

[c1179ec0] [c000fb08] ret_from_except+0x0/0x18

--- Exception: 901 at arch_cpu_idle+0x24/0x5c

LR = arch_cpu_idle+0x24/0x5c

[c1179f80] [c00a7990] rcu_idle_enter+0xac/0xec (unreliable)

[c1179f90] [c0080454] cpu_startup_entry+0x114/0x164

[c1179fc0] [c07e47bc] start_kernel+0x2dc/0x2f0

[c1179ff0] [c00003fc] skpinv+0x2e8/0x324

Instruction dump:

4e800421 80be0204 4bffff48 7fc3f378 4bfe7c01 7fc4f378 7c651b78 3c60c07a

7fe6fb78 38639af0 4cc63182 480c50e9 <0fe00000> 39200001 993cb406 4bffffb4

---[ end trace 49ed838668e6193f ]---

...

After this happens there appears to be no successful network traffic outbound (we haven't been able to check inbound yet), but the switch does see continuous bad transmissions coming from the unit. It seems to occur under moderate load conditions (around 15-20Mb/s fairly steady transmission). Very heavy (e.g. nearly 100Mb/s) and very light network traffic doesn't seem to trigger the issue, and it seems to be very sensitive to something as yet unidentified as we are struggling to replicate it so consistently on other units (although it has been seen).

The network can recovered by using ifconfig down/ifconfig up via the serial port.

No issue has been seen with the same unit/config/switch but at 1Gb/s.

Are there any known issues like this, and any suggested fixes/workarounds?

Thanks for any help,

David Dadson

bpe · ‎07-30-2016

Netdev watchdog goes off when eTSEC stops serving TxBDs. If there are no fatal errors in IEVENT (gianfar driver should report them on the console), this most likely indicates a problem with the transmit clock. If the clock ismissing or of a bad shape, the interface freezes.

Have a great day,
Platon

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

ddadson · ‎11-03-2016

Hi Platon,

Thanks for your reply - it has taken us quite a long time to investigate, but we now believe the problem is the DMA engine stalling, although we have no idea why.

Currently we can work around the problem (by stopping/restarting the DMA using gfar_halt()/gfar_start() when num_txbdfree drops below a threshold and gfar_poll_tx_sq() hasn't been called recently). This isn't really ideal - are you aware of any driver fixes or P1022 errata that might be related?

Thanks,

David

Network failure w/ fsl-gianfar "transmit queue N timed out" traceback

Network failure w/ fsl-gianfar "transmit queue N timed out" traceback

QorIQ P1 Devices