Hardware / Software Summary:
Our application required more USB Host ports than were implemented by the i.MX7 SABRE board, so the Microchip USB3503 Hub was included to add the two additional ports that we needed. The guidelines for hardware design for both the i.MX7D and the USB3503 were taken into account and the critical routing of the DATA and STROBE signals was implemented to be less than 1" with their length matched to within just a little more than 5 mils. Most boards work correctly 99% of the time, but once in a while I have seen a system fail to enumerate the USB Hub. In these cases, a reboot (equivalent to a cold boot because the PMIC is forced off then turned back on again) will result in the USB Hub enumerating correctly.
This was true until I discovered two boards that fail 99.9% of the time on a cold boot / reboot. With these boards, I actually have something to try to find the root cause. First the high level visibility of the issue in the console serial output is:
ci_hdrc ci_hdrc.2: EHCI Host Controller
ci_hdrc ci_hdrc.2: new USB bus registered, assigned bus number 2
ci_hdrc ci_hdrc.2: USB 2.0 started, EHCI 1.00
usb usb2: New USB device found, idVendor=1d6b, idProduct=0002
usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
usb usb2: Product: EHCI Host Controller
usb usb2: Manufacturer: Linux 4.1.15+ ehci_hcd
usb usb2: SerialNumber: ci_hdrc.2
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 1 port detected
usb 2-1: new high-speed USB device number 2 using ci_hdrc
usb 2-1: device no response, device descriptor read/64, error -71
usb 2-1: device no response, device descriptor read/64, error -71
usb 2-1: new high-speed USB device number 3 using ci_hdrc
usb 2-1: device no response, device descriptor read/64, error -71.......
A normal system with a USB Flash drive on a USB Hub downstream port looks like this:
ci_hdrc ci_hdrc.2: EHCI Host Controller
ci_hdrc ci_hdrc.2: new USB bus registered, assigned bus number 2
ci_hdrc ci_hdrc.2: USB 2.0 started, EHCI 1.00
usb usb2: New USB device found, idVendor=1d6b, idProduct=0002
usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
usb usb2: Product: EHCI Host Controller
usb usb2: Manufacturer: Linux 4.1.15+ ehci_hcd
usb usb2: SerialNumber: ci_hdrc.2
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 1 port detected
usb 2-1: new high-speed USB device number 2 using ci_hdrc
usb 2-1: New USB device found, idVendor=0424, idProduct=3503
usb 2-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
hub 2-1:1.0: USB hub found
hub 2-1:1.0: 2 ports detected
usb 2-1.2: new high-speed USB device number 3 using ci_hdrc
usb 2-1.2: New USB device found, idVendor=0781, idProduct=5595
usb 2-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
usb 2-1.2: Product: Ultra USB 3.0
usb 2-1.2: Manufacturer: SanDisk
usb 2-1.2: SerialNumber: 4C531001641115101474
usb-storage 2-1.2:1.0: USB Mass Storage device detected
.......
Over the last 2 to 3 weeks I have taken the following steps to isolate the cause of the problem:
ci_hdrc ci_hdrc.2: detected XactErr len 0/8 retry 1
Looking deeper into ehci_hcd.c and getting some visibility on what ehci register read / write operations are being done, I found that a "bad board" is detecting that the UEI bit of the USB2_USBSTS register is being set to indicate that an USB error interrupt has been detected. The following debug output (non-standard code added by me) shows the first few register read / write operations when it tries to enumerate the USB3503 Hub.
On a "bad board":
usb 2-1: new high-speed USB device number 2 using ci_hdrc
ehci_readl: 0xf5b30144 = 0x00000080
ehci_writel: 0xf5b30140 = 0x00010b25
ehci_readl: 0xf5b30140 = 0x00010b25
ehci_readl: 0xf5b30144 = 0x00008082 <<-- read of USB2_USBSTS with UEI bit set
ehci_writel: 0xf5b30144 = 0x00000002 <<-- write of USB2_USBSTS to clear UEI bit
ehci_readl: 0xf5b30140 = 0x00010b25
ci_hdrc ci_hdrc.2: detected XactErr len 0/8 retry 1
ehci_readl: 0xf5b30144 = 0x00008082
ehci_writel: 0xf5b30144 = 0x00000002
ehci_readl: 0xf5b30140 = 0x00010b25
On a normally good board or during a resume from suspend:
usb 2-1: new high-speed USB device number 6 using ci_hdrc
ehci_readl: 0xf5b30144 = 0x00000080
ehci_writel: 0xf5b30140 = 0x00010b25
ehci_readl: 0xf5b30140 = 0x00010b25
ehci_readl: 0xf5b30144 = 0x00048081
ehci_writel: 0xf5b30144 = 0x00000001
ehci_readl: 0xf5b30140 = 0x00010b25
usb 2-1: usb_start_wait_urb length=18, retval=0
ehci_readl: 0xf5b30184 = 0x0a001205
ehci_writel: 0xf5b30184 = 0x0a001301
ehci_readl: 0xf5b30140 = 0x00010b25
usb usb2: usb_start_wait_urb length=0, retval=0
At this point, I've run out of places to look for new information so I'm stuck and need someones deep insight into what can trigger this error and particularly why it only happens on a cold boot / reboot. What is different for a suspend / resume than what is done on a cold boot / reboot? Why does the interface to the USB3503 work flawlessly if you can get past the initial enumeration, but that part is variable from board to board and between boot cycles?
Thanks,
Bill Gessaman
We have encountered a similar problem in my company with the same enumeration problem and error message from the kernel. We've found this thread to be particularly insightful. We use the same HUB chip (USB3503) connected to hsic interface in a iMX8QM processor. We have noticed that some variations in the electronics (physical layout of the wires) may lead to this problem. We give the following suggestions on how to further investigate this issue.
- Check the quality of the strobe and data signals. Because it's high speed data transmission, these wires are especially sensitive to interference.
- To improve the quality of the signals one can look at the pad configuration of data and strobe signals in the device tree. Assuming you have the correct pin Mux configuration, you could try different values of DSE (drive impedance). We have made a slightly different layout of the wires in a newer version of the HW that lead to poorer signal quality. Decreasing the drive impedance has improved the quality again. The enumeration problem disappeared. Praise goes to a brilliant colleague of mine. Hope you find these suggestions helpful.
Bill and All,
Thanks for sharing the info.
I have the other question. Dose anyone know the maximum USB endpoint numbers on i.MX7D?
USB endpint may be used for Control, Data-In or Data-Out as descripted in following link.
https://docs.microsoft.com/en-us/windows-hardware/drivers/usbcon/usb-endpoints-and-their-pipes
If we add too many devices (e.g. modem, USB flash disk, ...), some of them may not work for endpoint resource is not enough.
I study i.MX 7Dual Datasheet and it's mentioned two high-speed USB 2.0 OTG, one high-speed USB 2.0 host. It does not mention about maximum USB endpoint numbers. I study i.MX 7Dual Reference Manual and find its' mentioned "Up to 32 elements" on Figure 11-89 End Point Queue Head Diagram. I am not sure whether it means the maximum USB endpoint numbers on i.MX7D.
Regards,
Jordan
Bill and others-
Thank you kindly for your methodical debug/logging; you saved me much time.
I've been able to get the USB3503 <> MX6Q (feels like the same IP as MX7) to work reliably. It seems to point towards order of device bring-up.
The following changes were required:
This approach will yield the same 'error -71' messages that Bill saw during the time BETWEEN MX6 HSIC probe complete and the completion of the user-space application that brings up the USB3503. After the user-space application is complete you should see the USB3503 properly enumerate and function properly.
best
Tim
Folks-
After some further testing the above method still may encounter failures. I want to share the results of further debugging that I believe fully addresses the issue.
Recommendations:
Update PORTSC settings:
It appears there is an error in Marvell HSIC PHYs that is corrected for in the kernel. This appears to significantly mess up MX6|7 <=> USB3503. Navigate to drivers/usb/chipidea/host.c. Find the comment line that starts with "Marvell 28nm HSIC PHY"...
From here there are 2 options:
After updating to this approach I no longer see any of the mentioned "error -71..." errors. I have yet to see any errors/failure to enumerate after a few hundred boots.
best
Tim
Allen-
Add this to your top:
"usbnophy0: usbphy@0 { compatible = "usb-nop-xceiv"; }" for Controller Core 2 "usbnophy1: usbphy@1 { compatible = "usb-nop-xceiv"; }" for Controller Core 3
Down in your I2C DTS portion, Include (only Controller Core 2, CC3 is similar):
im2:i2c@2{ #address-cells = <1>; #size-cells = <0>; reg = <2>; usbhub2: usb3503@08 { compatible = "smsc,usb3503"; reg = <0x08>; initial-mode = <1>; // HUB reset-gpios = <&gpio6 4 GPIO_ACTIVE_LOW>; status = "okay"; }; };
And update usbh2 (CC2 only):
&usbh2 {
pinctrl-names = "idle", "active"; pinctrl-0 = <&pinctrl_usbh2_1>; pinctrl-1 = <&pinctrl_usbh2_2>; phy_type = "hsic"; dr_mode = "host"; fsl,usbphy = <&usbnophy0>; fsl,anatop = <&anatop>; status = "okay"; };
Unfortunately the SOM we are using does not have an i2c attachment to the USB3503 controller. I have verified the "compatible" settings and the matching settings in the those under &usbh2 in your last comment. Additionally I have added the check to skip the hw_port_test_set calls when an iMX device is in use. I am still experiencing the error detailed in the original post.
Do you believe the i2c connection is critical to the solution? If so what leads you to suspect that?
Thanks
Allen-
The usb3503 driver resets the device and sets reasonable defaults. I can't say if it's essential or not as my board as the connection. Unfortunately I don't have time to test w/o loading the driver module.
Is it possible for you to use some test points to connect the MX6|7 to the 3503?
Bill,
Thanks for the details in your message. As usual, you only pass on the real hard questions :smileyhappy:
In you notes above, you indicated that a comparison between a known good board shows no difference, have you compared the transactions to the spec? The timing may be off a bit from spec, and the parts are tolerant of this on most systems but not all.
Also, to answer one of your questions about the differences between a power up and a resume from a power down mode. The main difference is the reset sequence. For power up, there is a full por sequence going on, and for a resume, much of the por is assumed completed. So the handshake from the HUB device seems to be seen on the resume, and not on powerup.
check the timing of the reset and handshake of the hub device.
thanks,
mark
Hi Mark,
I do my best to avoid wasting people's time with mundane questions. I was pretty happy to see you pop in on this post as you have always found the answers to the hard questions!
I agree with you that it feels like the "timing" may be off and some systems just manage to tolerate it better than others. When I think worst case, this timing would be the timing of the DATA and STROBE signals of the HSIC interface, but scoping these signals while not being intrusive, then being sure that they are meeting the spec is quite challenging. Were you referring to the timing of these two signals, or some more general timing like the relationship of the reference clock, reset or connect signals to the external Hub?
The thing that seems to make the interface work is to put the USB3503 Hub through a reset cycle after all the initialization of the i.MX7 USB initialization is complete. At power up the USB3503 gets initialized about 1 second before the HSIC Host interface in the i.MX7 gets initialized. When doing a resume from LPSR power down mode, the i.MX7 peripheral registers get reloaded from the copy in internal RAM very early on, then the usbmisc_imx driver detects that power was lost so it runs an init function, and lastly the USB3503 Hub driver resume function gets called to put it back into its connected state. This may be important and we may be looking at a side effect of how we constructed our Linux device tree, since we didn't find any examples of other reference designs that actually implemented the USB via a HSIC interface. I'm going to see if I can figure out how to insert a reference to the USB3503 in the"usbh" section of the device tree so that the USB3503 probe function gets deferred until after the "usbh" device probe is complete.
Thanks,
Bill
Bill,
It seems that you have the root cause figured out here. Do you need my help with the dtsi file so the peripherals are initialized in the proper order?
Please let me know,
mark
Mark,
I don't really feel like the is a known root cause at this time. There is circumstantial evidence that suspend / resume corrects the state of the HSIC interface to the external Hub, but there is no real "smoking gun". I've been unable to work on this issue full time the last few days but I do have some updates of things that I tried.
I intend to keep digging but will be out of office until next Wednesday.
Thanks,
Bill
Bill,
In discussing this with the BSP team, they do indicate the order of execution is NOT controlled in anyway by the DTS file. This file is used for the addresses of the device. So, making changes here will not change the order of execution, which you have already shown.
Thanks for the datapoint that you have here about the board still fails, even when the i.MX init has completed prior to the USB3503. This is somewhat puzzling. I will continue to discuss here, and if I come up with anything pertinent, I will let you know. Please keep me informed on your progress.
thanks,
mark
Mark,
My apologies for the number of questions in this post, but I have found no solution yet. First I would like to address the prior topic of ordering of peripherals in the DTSI file. In our application, we have two USB host ports by using the peripherals "usbotg2" and "usbh" as shown in the standard imx7d.dtsi file and when Linux starts up they become "usb1" and "usb2" respectively. If I alter the order of these peripherals in the imx7d.dtsi file as shown below, then "usbh" becomes "usb1" and "usbotg2" become "usb2". The initialization shown in dmesg does show usb1 always getting initialized first and then usb2, so this maneuver does successfully swap the order of initialization.
usbh: usb@30b30000 {
compatible = "fsl,imx7d-usb", "fsl,imx27-usb";
reg = <0x30b30000 0x200>;
interrupts = <GIC_SPI 40 IRQ_TYPE_LEVEL_HIGH>;
clocks = <&clks IMX7D_USB_CTRL_CLK>;
fsl,usbphy = <&usbphy_nop3>;
fsl,usbmisc = <&usbmisc3 0>;
phy_type = "hsic";
dr_mode = "host";
phy-clkgate-delay-us = <400>;
status = "disabled";
};usbotg1: usb@30b10000 {
compatible = "fsl,imx7d-usb", "fsl,imx27-usb";
reg = <0x30b10000 0x200>;
interrupts = <GIC_SPI 43 IRQ_TYPE_LEVEL_HIGH>;
clocks = <&clks IMX7D_USB_CTRL_CLK>;
fsl,usbphy = <&usbphy_nop1>;
fsl,usbmisc = <&usbmisc1 0>;
phy-clkgate-delay-us = <400>;
status = "disabled";
};usbotg2: usb@30b20000 {
compatible = "fsl,imx7d-usb", "fsl,imx27-usb";
reg = <0x30b20000 0x200>;
interrupts = <GIC_SPI 42 IRQ_TYPE_LEVEL_HIGH>;
clocks = <&clks IMX7D_USB_CTRL_CLK>;
fsl,usbphy = <&usbphy_nop2>;
fsl,usbmisc = <&usbmisc2 0>;
phy-clkgate-delay-us = <400>;
status = "disabled";
};
I don't want to get into an argument with your BSP team (because I desperately need their help right now!), but I do want to be clear about what I have seen. Device tree is a mysterious beast and it is easy to get unpredictable results.
Now the questions:
Enough for now - I'm not sure exactly what to look into next but I intend to keep digging until I figure this out.
Thanks,
Bill
Bill,
I would like to try and answer your questions here:
1) We do not have a customer reference design. We usually have smd connectors on our board on the hsic stubs. This is used to verify the timing and diagrams of the outputs.
2) Yes we do have customer's using this interface.
3) My initial findings indicate the controller for the 7D is the same as the imx6SX, the phy's are different due to the different process nodes.
4 & 5) I need to pass this onto the software driver engineers to answer.
The questions on the dts file changing the order of initialization, while your testing does show this, some of our experts are indicating that it will not be deterministic, since the kernel can send the tasks to different cores at different time. If you would like, I can have your dts file examined by our software experts here. I know some of your information is confidential to you, you can send it to my nxp email address and not post it here. (mark.ruthenbeck@nxp.com
Also on your hsic hub chip, I have not pulled up the datasheet on this (that is why my questions) is there a por or other reset lines that may be timed differently on the boards that fail.
Let me know,
mark
Hi Mark,
I agree that I probably got lucky with the dts file change that I made, but the important part is that the result happened to change the initialization order without it fixing the issue so I was able to determine that the order was not the issue.
The USB3503 does have an active low reset that is controlled by an i.MX7 GPIO output. This is how the system puts the hub into "standby" to get low power consumption during suspend. There is also an i.MX7 GPIO output that controls the HUB_CONNECT pin on the hub. The USB3503 driver can be configured to initiate the transition to Hub Communication Stage where the enumeration and subsequent USB traffic happens. We have both the HUB_CONNECT GPIO as well as the I2C interface which can also be used to trigger the state transition, but I haven't been able to demonstrate any difference based on which one is used.
I'm going to send you an email which will include a copy of our device tree file. One area where we have always had issues is properly defining reference clocks to be output by the i.MX7 and used by external components. The USB3503 is a case of this in that we route 24 MHz out of the i.MX7 to the USB3503. Our usual challenge is having the clock come on early enough and then keeping the Linux power management from turning it back off because it thinks there are no consumers. In fact, we have included "clk_ignore_unused" on the Linux command line to avoid this power management optimization getting in our way.
Thanks,
Bill
Hi Bill,
I am facing the exact same issue with our custom IMX7 dual board and HSIC USB3503 chip. Have you resolved this issue? Can you share the solution with us?
Thanks and regards,
Gopinath S
Hi Gopinath,
I fully expected that someone would experience the same issue with the i.MX7D HSIC interface and the Microchip USB3503 Hub, so your question didn't come as much of a surprise.
We were never able to resolve the issue, so we replaced the USB3503 with a USB2513 that we had several years of experience with in a previous product. We had to abandon using the HSIC interface and used the USB-OTG2 port on the i.MX7D to connect to the upstream port on the USB2513. We had been using only two downstream ports on the USB3503 Hub, so this was acceptable in that we now use all three downstream ports on the USB2513 Hub.
I had very good support from a FAE at Microchip and Mark Ruthenbeck as always did everything humanly possible to get information from the technical team at NXP, but no evidence ever surfaced that made the i.MX7D or the USB3503 the prime suspect in the issue. It didn't help that only some circuit boards would fail to enumerate the USB3503 and then it only happen some of the time! There was a temperature sensitivity to the issue as well. Don't forget that it only happened on a cold start and would never happen on a system suspend / resume which would re-enumerate the USB3503 Hub during the resume. Once the USB3503 had enumerated, its operation as a USB Hub was flawless with the i.MX7D.
My instincts were telling me that it was some critical initialization timing issue that might be related to getting a PLL settled properly, clock distribution or ?? The complexity of the Linux USB software hierarchy became the biggest challenge in trying to alter the timing of the USB3503 initialization and enumeration in a way that affected the issue. I really hate to leave issues like this unresolved but the timeline to get our product production ready became more important. I'm wishing you much success in finding a solution!
I really do wish that NXP reference designs made it possible to prove out the USB HSIC interface with a real circuit like the USB3503. We might have been able to avoid getting in this deep before finding out that there is some sort of system issue here.
Best Regards,
Bill Gessaman