Dear NXP Team,
We are working on kernel 4.14.98 (GA BSP) on iMX8DX dual core NXP processor. We are using this device to connect to LTE using xhci interface via usb using pppd. Currently we are facing two issues
RCU stall warnings and hence system hangs after seeing crashes. The issue occurrence is random but exists. Sometimes we observe RCU stall warnings immediately after device boots up and sometimes we observe after running the device for longs hours (> 12 hours). This issue was not present when we used L4.14.62 Beta BSP.
USB Event Ring Full error. “xhci-cdns3: ERROR unknown event type 37”. We get continuous prints in our serial console making system unstable after keeping devices for long hours.
One the above issues is being produced when the device boots up using GA BSP L4.14.98.
We have tried applying the patch as mentioned in below link for fair.c file in kernel for RCU crash.
For USB event error, below are patches used. However we observe LTE ping command not working sometimes
https://patchwork.kernel.org/patch/10880073/
https://patchwork.kernel.org/patch/11165607/
With above patches, issue is getting reproduced less frequent, but still exist.
Can you please help us in how we can resolve the issue. How to debug and narrow down the root cause. Without resolving this issue, we are blocked.
Attached Logs for more information:
INFO: rcu_preempt self-detected stall on CPU
[ 1076.614300] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1076.616581] 0-...: (1 GPs behind) idle=e42/2/0 softirq=40213/40214 fqs=1438
[ 1076.622233] 0-...: (1 GPs behind) idle=e42/2/0 softirq=40213/40214 fqs=1438
[ 1076.629359]
[ 1076.636490]
[ 1076.636493] (t=5253 jiffies g=14886 c=14885 q=74)
[ 1076.636497] Task dump for CPU 0:
[ 1076.638075] (detected by 1, t=5253 jiffies, g=14886, c=14885, q=74)
[ 1076.639646] swapper/0 R running task 0 0 0 0x00000002
[ 1076.660972] Call trace:
[ 1076.663426] [<ffff000008089668>] dump_backtrace+0x0/0x3c8
[ 1076.668831] [<ffff000008089a44>] show_stack+0x14/0x20
[ 1076.673886] [<ffff0000080f5cf4>] sched_show_task+0x14c/0x180
[ 1076.679549] [<ffff0000080f6c48>] dump_cpu_task+0x40/0x50
[ 1076.684868] [<ffff000008127288>] rcu_dump_cpu_stacks+0x94/0xd4
[ 1076.690705] [<ffff00000812604c>] rcu_check_callbacks+0x64c/0x958
[ 1076.696716] [<ffff00000812a1f4>] update_process_times+0x2c/0x58
[ 1076.702643] [<ffff000008139660>] tick_sched_handle.isra.4+0x30/0x48
[ 1076.708913] [<ffff0000081396b8>] tick_sched_timer+0x40/0x90
[ 1076.714490] [<ffff00000812ae70>] __hrtimer_run_queues+0xe8/0x168
[ 1076.720502] [<ffff00000812b0f0>] hrtimer_interrupt+0xa8/0x230
[ 1076.726256] [<ffff0000085f2808>] arch_timer_handler_phys+0x28/0x48
[ 1076.732441] [<ffff000008119950>] handle_percpu_devid_irq+0x80/0x138
[ 1076.738711] [<ffff0000081141bc>] generic_handle_irq+0x24/0x38
[ 1076.744462] [<ffff00000811482c>] __handle_domain_irq+0x5c/0xb8
[ 1076.750299] [<ffff000008080fe0>] gic_handle_irq+0x78/0x174
[ 1076.755789] Exception stack(0xffff000008003d90 to 0xffff000008003ed0)
[ 1076.762236] 3d80: 0000000000000000 ffff000008d07880
[ 1076.770073] 3da0: 0000000000000000 000000010002e1a9 000000000000002e 0000000000000002
[ 1076.777909] 3dc0: 0000000000000000 0000000040000000 ffff8000121fc248 ffff000008003c70
[ 1076.785745] 3de0: 00000000960c0650 0000000000000001 0000000000000401 ffff000009205750
[ 1076.793581] 3e00: 0000000000000000 0000000000000000 ffff00000813e220 0000000000000000
[ 1076.801418] 3e20: 0000000000000000 ffff000008b4f000 000000000000001a ffff000008d07880
[ 1076.809254] 3e40: ffff000008b4f018 ffff800012405780 ffff000008d06000 000000010002e1a8
[ 1076.817090] 3e60: ffff000008004000 ffff000008890000 0000000000000001 ffff000008003ed0
[ 1076.824926] 3e80: ffff0000080d0340 ffff000008003ed0 ffff000008081184 0000000040000145
[ 1076.832762] 3ea0: ffff8000125ab0a4 ffff8000125ab000 0000ffffffffffff ffff8000125ab028
[ 1076.840597] 3ec0: ffff000008003ed0 ffff000008081184
[ 1076.845480] [<ffff000008082a30>] el1_irq+0xb0/0x124
[ 1076.850362] [<ffff000008081184>] __do_softirq+0xa4/0x218
[ 1076.855681] [<ffff0000080d0340>] irq_exit+0xd0/0xf0
[ 1076.860561] [<ffff000008114830>] __handle_domain_irq+0x60/0xb8
[ 1076.866399] [<ffff000008080fe0>] gic_handle_irq+0x78/0x174
[ 1076.871888] Exception stack(0xffff000008c83dd0 to 0xffff000008c83f10)
[ 1076.878327] 3dc0: 0000000000000000 0000000000000000
[ 1076.886164] 3de0: 0000000000000001 0000000000000000 ffff000008b53108 ffff000008c83f00
[ 1076.894001] 3e00: 000080003740d000 ffff80003ff65b60 ffff000008c952c0 ffff000008c83e90
[ 1076.901836] 3e20: 00000000000008e0 0000000000000000 0000000000000001 0000000000000000
[ 1076.909673] 3e40: 00000000f6045434 0000000000000000 ffff00000813e220 0000000000000000
[ 1076.917509] 3e60: 0000000000000000 ffff000008b4f018 ffff000008c8a000 ffff000008c8a000
[ 1076.925345] 3e80: ffff000008b57a30 ffff000008c8aa60 0000000000000000 0000000000000000
[ 1076.933181] 3ea0: ffff000008c94980 0000000000000400 0000000080cf0018 ffff000008c83f10
[ 1076.941017] 3ec0: ffff000008084efc ffff000008c83f10 ffff000008084f00 0000000000000145
[ 1076.948853] 3ee0: 0000000000000000 00000000bfe9eb84 ffffffffffffffff ffff0000081397dc
[ 1076.956688] 3f00: ffff000008c83f10 ffff000008084f00
[ 1076.961570] [<ffff000008082a30>] el1_irq+0xb0/0x124
[ 1076.966454] [<ffff000008084f00>] arch_cpu_idle+0x10/0x18
[ 1076.971770] [<ffff000008106f50>] do_idle+0x120/0x1e0
[ 1076.976739] [<ffff0000081071a8>] cpu_startup_entry+0x20/0x28
[ 1076.982406] [<ffff00000886b314>] rest_init+0xcc/0xd8
[ 1076.987377] [<ffff000008af0c1c>] start_kernel+0x3d4/0x3e8
[ 1076.992774] Task dump for CPU 0:
[ 1076.995997] swapper/0 R running task 0 0 0 0x00000022
[ 1077.003050] Call trace:
[ 1077.005497] [<ffff0000080853ac>] __switch_to+0x94/0xd8
Status update:
We checked the difference between L4.14.62 and L4.14.98 on kernel/rcu/, only one commit is added on L4.14.98:
commit 077506972ba23772b752e08b1ab7052cf5f04511
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date: Mon Jul 9 13:47:30 2018 -0700
rcu: Make need_resched() respond to urgent RCU-QS needs
After revert this commit and apply the USB patch, the RCU stall issue is fixed on L4.14.98.
Best regards,
Danwei
XinyuChen please continue with the follow up.
Hi,
We are also experiencing the same problem on iMX8QXP using the 4.14.98 (GA BSP). We have also tested the 4.19.35 BSP which does not solve the problem.
In our case we have a modem connected to USB.
We have also applied the patch [v2,1/1] usb: host: xhci: update event ring dequeue pointer on purpose - Patchwork
which reduces the amount of “xhci-cdns3: ERROR unknown event type 37”, but it does not solve the root cause. When we have applied the patch and the “xhci-cdns3: ERROR unknown event type 37” situation occurs we see that the cpu load increases significantly. According to 'top' the increased cpu load is in irq context. However, we do not see a significant increase in the amount of interrupts when looking in /proc/interrupts. When we unbind the usb port the cpu load goes back to normal.
Occasionally this issue has also led to kernel panics. All kernel panics that have been analyzed are preceded by a time out in the SCU communication. We see "Timeout for IPC response!" in the logs.
I have been trying to capture traces by enabling tracing in the kernel and then enabling the following:
echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
echo 1 > /sys/kernel/debug/tracing/events/cdns3/enable
echo "(irq == 30) || (irq == 27)" > /sys/kernel/debug/tracing/events/irq/irq_handler_entry/filter
echo "(irq == 30) || (irq == 27)" > /sys/kernel/debug/tracing/events/irq/irq_handler_exit/filter
echo 1 > /sys/kernel/debug/tracing/events/irq/enable
echo 1 > /sys/kernel/debug/tracing/tracing_on
Unfortunately I have not been able to read out the trace afterwards since I get a kernel panic every time the “xhci-cdns3: ERROR unknown event type 37” occurs when traces are enabled.
I will continue trying to capture traces to see what happens.
BR,
Jonas
From the log, seems the USB host runtime pm blocked for long time, please try to disable the runtime pm to try:
diff --git a/drivers/usb/cdns3/host.c b/drivers/usb/cdns3/host.c
index f6633de..aac6158 100644
--- a/drivers/usb/cdns3/host.c
+++ b/drivers/usb/cdns3/host.c
@@ -155,6 +155,7 @@ static int cdns3_host_start(struct cdns3 *cdns)
pm_runtime_set_active(dev);
pm_runtime_no_callbacks(dev);
pm_runtime_enable(dev);
+ pm_runtime_forbid(dev);host->hcd = __usb_create_hcd(&xhci_cdns3_hc_driver, sysdev, dev,
dev_name(dev), NULL);
Also, please check the top in background in loop before issue happen, see the irq/sirq loading.
Dear NXP Team,
Dear Xinyu,
Thanks for your reply.
We had tried applying patch (disable runtime pm of usb) as suggested from your side. Additionally, below are the list of patches that's applied
1. Downgraded USB driver from GA BSP (L 4.14.98 GA) to Beta BSP (L4.14.62)
2. [v2,1/1] usb: host: xhci: update event ring dequeue pointer on purpose - Patchwork
3. usb: xhci: Support running urb giveback in tasklet context - Patchwork
4. Reverted patch present in RCU (rcu: Make need_resched() respond to urgent RCU-QS needs)
commit 077506972ba23772b752e08b1ab7052cf5f04511
Observations:
1. Issue reproducing rate is decreased, but still exists.
Reason:
After fresh flashing, device was booted (Boot time ~40 seconds) and login prompt came. As next step, we had provided device a soft reboot from terminal using reboot command. Immediately upon coming of login prompt in next reboot, we had received RCU crash.
2. If the RCU crash is not reproduced, the device is capable to run for overnight. Max time we have observed is 12 hours. Longer hours still we need to test and see the behavior.
Is it possible to provide the full RCU stall warning log?
as this is general linux issue please post it on kernel mail list
Majordomo Lists at VGER.KERNEL.ORG
Best regards
igor
No, RCU stall can be caused by various of delay, live-lock in kernel/driver, etc.
Dear NXP team,
Thanks for your analysis and reply. I am attaching the log captured for crash occurred. This behavior we didn't observed when we used L4.14.62 Beta BSP.
Logs:
Hi Jeffin
such xhci error is described on
Re: kernel: xhci_hcd 0000:00:14.0: ERROR unknown event type 37 - Kernel 4.19.13 — Linux USB
Regarding "kernel 4.14.98 (GA BSP) on iMX8DX" - it is recommended to try latest rev.C0
production part (ended with "AC") and with L4.14.98_2.3.0 kernel (currently only kernel supporting C0)
Linux 4.14.98_2.3.0 Documentation
Best regards
igor
-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------