Hi,
I would like some advice on a USB issue that we are struggling with. We are using BSP 4.19.35.
The modem that is connected to the i.MX8qxp via a usb hub sometimes misbehaves and we see 'USB transaction errors' for all transfers in the xhci traces. At least we think it is the modem that is misbehaving. The modem is the only device connected to the hub.
The 'USB transaction error' causes a loop where the xhci driver seems to reset the endpoint, sets the TR dequeue pointer and then starts a new transfer which will also fail and then the loop repeats it self. This causes a lot of events to be generated on the event ring which causes “xhci-cdns3: ERROR unknown event type 37” spam and as a consequence almost all cpu time is spent executing xhci interrupts. This sometimes leads to RCU stalls and sometimes kernel panics. The RCU stalls have been discussed in https://community.nxp.com/thread/524191.
When we end up in a loop where all transfers fail we have also discovered that the loop cannot be stopped by pulling the reset pin of the hub. We can see the TRB_PORT_STATUS event being handled in xhci_handle_event() function but the loop continues despite there being nothing connected to USB any longer.
The only thing we can do to get out of this situation is either to unbind the driver or reboot the board.
What can we do to prevent the i.MX8 from becoming unreliable when a USB device is misbehaving?
Please find attached an extract of an xhci-cdns3 showing the symptoms.
BR,
Jonas
Hi Jonas Karlsson,
Could you try attached patch and see how it goes? This patch should fix the error:
xhci-cdns3: ERROR unknown event type 37
Best regards,
Danwei
Hi danweiluo,
We have tried this patch earlier and we saw that it reduced the frequency of the "xhci-cdns3: ERROR unknown event type 37" by a factor 30. From 150 per second to 5 per second when getting USB transaction errors.
We also still see RCU stalls despite applying the last patch that you recommend in this thread: rcu_preempt self-detected stall on CPU and usb xhci event error 37
The behavior is hard to debug using trace and/or printk since the behavior seem to change when adding them.
BR,
Jonas
Hi Jonas Karlsson,
1. Could you also share the kernel log when you have USB ERROR and RCU stall crash?
2. Are you using i.MX8QXP MEK board or your own design board? If you are using MEK, could you share the steps to reproduce the issue, like the USB modem type and command you are using?
Best regards,
Danwei
Hi danweiluo,
I have captured some kernel logs this morning. I got two kernel panics and this time they seem to be similar. You find the logs attached.
The kernel used in this patch reverted as you recommended:
commit 077506972ba23772b752e08b1ab7052cf5f04511
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date: Mon Jul 9 13:47:30 2018 -0700
rcu: Make need_resched() respond to urgent RCU-QS needs
I have also applied this patch:
commit 2337d89627e33c72a44dc3f0a1b7651a657c855e
Author: Peter Chen <peter.chen@nxp.com>
Date: Fri Nov 15 18:50:00 2019 +0200
usb: host: xhci: update event ring dequeue pointer on purpose
I have also posted a question on the Linux-USB mailing list and Greg K-H has responded on that question. Please have a look at that here: https://marc.info/?l=linux-usb&m=158324886017513&w=2
BR,
Jonas
Hi danweiluo,
As a side note I can mention that we have found an easy way to generate an event storm that causes a lot of unknown event 37 prints. This scenario, however, has not led to any kernel panics so far. Only RCU stalls which it recovers from.
To start the event storm we power up the modem and wait until is has enumerated and the ACM ttys have been created. Then power off the VUSB supply to the modem. This supply feeds the USB port of the modem. This will immediately causes an "event storm". When printing the event ring content we see a lot of USB transaction errors. We also see high cpu load in irq context.
BR,
Jonas
Hi danweiluo,
Another way to cause an event storm is to power up the modem and let it enumerate. Then I pull the reset pin on the usb hub and keep it in reset. This causes unknown event type 37 to be printed continuously. It never stops.
According to the hub datasheet the reset causes the PHYs on the hub to be disabled, and the differential pairs will be in a high-impedance state.
Disabling power management for usb on the host side as suggested by Peter Chen in this thread does not make a difference. The event storm continues. https://marc.info/?l=linux-usb&m=158345830319993&w=2
When I power down vusb to the modem, as I mentioned in the last post, the usb host normally disconnects the device after a while.
BR,
Jonas
Hi Jonas Karlsson,
I saw that seems the issue is solved under the help from Oliver Neukum.
USB transaction errors causing RCU stalls and kernel panics
And seems reverting the RCU commit is not neended.
I've made a patch set according to your test in the thread, and want to confirm with you if these are the patches that solved your issue.
Thanks for the help!
Best regards,
Danwei
Hi danweiluo,
The patches from Oliver Neukum seem to stop the event spam that we have experienced. However, as I understand Oliver also says that these patches should not be needed. It indicates that the host controller does not behave as it should.
I think NXP should look into this together with the IP vendor of the host controller in the i.MX8 (Cadence).
Oliver has not commented on my last test result. I'm not sure if Oliver thinks this patch should be included or not:
commit 7c8f7af078a4eda73f347667d12584736e613062
Author: Oliver Neukum <oneukum@suse.com>
Date: Thu Mar 5 11:16:02 2020 +0100
cdc-acm: close race betrween suspend() and acm_softint
I will do some further testing today and hopefully Oliver gives his opinion on what patches should be included.
I will get back to you when I know more.
BR,
Jonas
Hi,
I have told Oliver that the patchset seems to work fine. According to Oliver the "cdc-acm: close race betrween suspend() and acm_softint" patch will also be included. They will go through review and will be pushed upstream.
Will you start an investigation on why the host controller needs these patches?
BR,
Jonas
Hi Jonas Karlsson,
From the feedback of R&D, he thinks this issue is not related to our driver. it belongs to common USB CDC driver.
Best regards,
Danwei
Hi,@danwei_luo
The same issue "ERROR unknown event type 37"@jonas_karlssonfaced is happening on the iMX8mp evk too, we connect an USB video capture device to the USB port, and we can not see the desired resolution.I also posted the question in another thread
we already tried to apply all the patches mentioned here, still no success.
we do believe that there are some issues on the NXP USB host controller hardware or USB host driver code.
we have struggled for long time and please help.
Thanks in advance.
Hi @jeffson ,
The issue discussed in this thread is about USB errors causing kernel panic. It may not be the same root cause as your issues. I will suggest you to still follow the thread https://community.nxp.com/t5/i-MX-Processors/Can-not-get-the-maximum-frame-rate-on-IMX8mp-evk-device... since this thread is already closed.
Best regards,
Danwei
we also saw the same kernel panic, I did not mentioned it.I think it is very close
and can you tell me how you guys managed to solve this issue?
Hi @jeffson ,
Please check the previous reply in this thread, the previous issue is solved by the attached patch set.
I saw that seems the issue is solved under the help from Oliver Neukum.
USB transaction errors causing RCU stalls and kernel panics
And seems reverting the RCU commit is not neended.
I've made a patch set according to your test in the thread, and want to confirm with you if these are the patches that solved your issue.
Best regards,
Danwei
Hi Jonas Karlsson,
Thanks for the info, I've created a JIRA ticket for the USB owner and waiting for the feedback.
I'll keep you updated.
Best regards,
Danwei
Hi danweiluo,
1. I only have kernel logs saved from crashes prior to applying the xhci patch that updates the dequeue pointer and before reverting RCU stall patch. I will gather new logs and send them to you A.S.A.P
2. We have custom board. We have not tried to reproduce the error on the MEK board.
danweiluo can you comment here?