USB silently hangs on imx6q in presence of transmission errors (EMI or data lanes shorting)

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

USB silently hangs on imx6q in presence of transmission errors (EMI or data lanes shorting)

1,501 Views
aballier
Contributor I

We are investigating a strange USB hang with linux 4.1.38-fslc (branch 4.1-1.0.x-imx): It was first found by sending EMI to a usb cable between an imx6q based product and a USB LTE modem. We then discovered we could reproduce this by simply shorting the data lanes of the USB cable for a short period of time (a couple seconds).

As it is simpler to do, I'll focus on the data lanes short method:

- Communicate with a USB device on the USB host port of the imx6q (tried with a LTE modem and a simple USB drive, the key seems to be that there is some ongoing communication)

- Short the USB data lines on the cable connecting the imx6q and the device (white & green wires)

Most of the time nothing happens in dmesg / syslog, sometimes with the LTE modem I see this:  `[ 63.428507] option1 ttyUSB1: option_instat_callback: error -71` and the USB port seems just dead. The devices are not removed from /dev, they are simply not answering at all.

If I unplug/replug the device, nothing happens either. I can short VBUS to ground to trigger an overcurrent condition so that part of the USB stack is reinitialized and then it detects the device is gone and some device has just been plugged but it still fails to enumerate:

```

[ 512.041415] usb 2-1: new full-speed USB device number 6 using ci_hdrc
[ 522.501370] usb 2-1: device not accepting address 6, error -110
[ 522.621472] usb 2-1: new high-speed USB device number 7 using ci_hdrc
[ 533.081380] usb 2-1: device not accepting address 7, error -110
[ 533.087572] usb usb2-port1: unable to enumerate USB device

```

After enabling some EHCI debugging in the kernel, when I short the data lanes I see a lot of those messages:

```

[ 379.546153] ci_hdrc ci_hdrc.1: detected XactErr len 0/4096 retry 16
[ 379.552440] ci_hdrc ci_hdrc.1: detected XactErr len 0/1 retry 17
[ 379.558458] ci_hdrc ci_hdrc.1: detected XactErr len 0/4096 retry 16
[ 379.564735] ci_hdrc ci_hdrc.1: detected XactErr len 0/4096 retry 17
[ 379.571013] ci_hdrc ci_hdrc.1: detected XactErr len 0/10 retry 3
[ 379.577056] ci_hdrc ci_hdrc.1: detected XactErr len 0/1 retry 18
[ 379.583075] ci_hdrc ci_hdrc.1: detected XactErr len 0/4096 retry 17
[ 379.589353] ci_hdrc ci_hdrc.1: detected XactErr len 0/4096 retry 18
[ 379.595645] ci_hdrc ci_hdrc.1: detected XactErr len 0/1 retry 19
[ 379.601664] ci_hdrc ci_hdrc.1: detected XactErr len 0/4096 retry 18

```

If I keep the data lanes shorted, at some point this flow of error stops. It seems to me the EHCI controller is indeed seeing the errors but somehow hangs at some point.

Is this something anyone has ever seen ? Can this be fixed or recovered from ? I've tried unbinding/rebinding usb2 and it does not allow to recover from that state. If I soft reboot the system (`reboot -n -f` from the command line) then it recovers, devices are enumerated and working properly. One thing to note is that this command takes much longer than usual when USB is hung (usual is instant reboot, with USB hung it is more like 30 secs).

Labels (1)
0 Kudos
2 Replies

1,270 Views
CarlosCasillas
NXP Employee
NXP Employee

Hi Alexis,
In general, it seems to be a hardware (layout) issue. The mentioned behavior (shorting data and power lines) for USB recovery seems to indicate signal integrity issues. Have you measured signal integrity (getting eye diagram)?

You could refer to the following documents for additional guidelines about USB:

Finally, you could also consider testing with a NXP BSP, as the mentioned 4.1.38-fslc is not supported by NXP. You could refer to the link below where you could get “L4.14.98_2.0.0” or “L4.1.15_2.0.0” BPSs:

Hope this will be useful for you.
Best regards!
/Carlos
-------------------------------------------------------------------------------
Note:
- If this post answers your question, please click the "Mark Correct" button. Thank you!
-------------------------------------------------------------------------------

0 Kudos

1,270 Views
aballier
Contributor I

Hi Carlos,

I have not personally verified signal integrity, but I have been told that when using a USB protocol analyzer, it becomes  completely silent after this bug occurs. One thing to note is that the USB PHY registers have counters about TX/RX errors and those definitely increase when triggering this issue, so at least the hardware is aware something wrong is happening.

I have tested with the latest on 4.14-2.0.x-imx branch on linux-fslc repo, which I believe is what L4.14.98_2.0.0 uses and got the same behavior. Note also that our kernel version is based on the L4.1.15_1.0.0 BSP, it's a bit old and we have custom patches for extra hardware drivers, but this is not something completely foreign either.

I finally found a way to recover: unbinding the entire USB controller, by `echo ci_hdrc.1 > /sys/bus/platform/drivers/ci_hdrc/unbind` and then rebinding it. Unloading and reloading the USB controller modules also allow to recover.

So my understanding is that a full reset of the controller allows to fix the issue. It does not recover without userspace  interaction, which is not ideal, but at least we can avoid rebooting.

0 Kudos