LS1043 rcu_sched self-detected stall on CPU

cancel
Showing results for 
Search instead for 
Did you mean: 

LS1043 rcu_sched self-detected stall on CPU

3,168 Views
saoan_ho
Contributor I

Hi All,

Our board is LS1043 + 512MB DDR3L, we meet the problem reboot randomly. 

Any help is greatly appreciated.

We had disabled watchdog first, then RCU stall warnings and hence system hangs after seeing crashes.

Sometimes we observe RCU stall warnings after device boots up one hour and sometimes we observe after running the device for longs hours (> 12 hours).

We had tried many methods like remove drivers, run memtester, or make CPU busy..., the issue occurrence is still random but exists.

But my observation is that will be less to occur while lower CPU idle(CPU busy).

For example, RCU stall warnings occurs in one hour while CPU idle>90%.

When CPU idle<10%, RCU stall warnings occurs after >24hour.

Would you please help us in how we can resolve the issue. How to debug and narrow down the root cause. 

 

Attached Logs for more information:

[285165.790740] INFO: rcu_sched self-detected stall on CPU
[285165.795988] 0-...: (1 GPs behind) idle=7a9/140000000000001/0 softirq=78438968/78438969 fqs=2
[285165.804689] (t=60000 jiffies g=34009425 c=34009424 q=50872)
[285165.810530] rcu_sched kthread starved for 60000 jiffies! g34009425 c34009424 f0x0 s3 ->state=0x1
[285165.819405] Task dump for CPU 0:
[285165.822716] memtester R running 0 25934 1 0x00000002
[285165.829167] Backtrace:
[285165.831710] [<8001aec4>] (dump_backtrace) from [<8001b0bc>] (show_stack+0x18/0x1c)
[285165.839368] r7:8058ac40 r6:8057c4c0 r5:9ac3d800 r4:00000000
[285165.845138] [<8001b0a4>] (show_stack) from [<80044458>] (sched_show_task+0xb4/0xd4)
[285165.852889] [<800443a4>] (sched_show_task) from [<80045778>] (dump_cpu_task+0x34/0x44)
[285165.860895] r5:00000000 r4:00000000
[285165.864570] [<80045744>] (dump_cpu_task) from [<80062aa0>] (rcu_dump_cpu_stacks+0x7c/0xa0)
[285165.872923] r5:00000000 r4:8058ac40
[285165.876597] [<80062a24>] (rcu_dump_cpu_stacks) from [<80065bd4>] (rcu_check_callbacks+0x22c/0x6ac)
[285165.885646] r9:00000000 r8:8057c504 r7:8057c100 r6:0000c6b8 r5:9fdb3080 r4:8058ac40
[285165.893507] [<800659a8>] (rcu_check_callbacks) from [<80067ec8>] (update_process_times+0x38/0x64)
[285165.902469] r10:00000001 r9:00000000 r8:805d5158 r7:00010312 r6:dcf44000 r5:00000001
[285165.910411] r4:9ac3d800
[285165.913035] [<80067e90>] (update_process_times) from [<80073cac>] (tick_periodic+0xa8/0xc0)
[285165.921475] r5:00000000 r4:8057c140
[285165.925148] [<80073c04>] (tick_periodic) from [<80073d54>] (tick_handle_periodic+0x90/0xa4)
[285165.933588] r5:00000000 r4:9fdb5580
[285165.937262] [<80073cc4>] (tick_handle_periodic) from [<802f18d4>] (arch_timer_handler_phys+0x30/0x40)
[285165.946571] r9:00023148 r8:00000011 r7:9b410880 r6:8059c1c4 r5:9b405cc0 r4:9fdb5580
[285165.954432] [<802f18a4>] (arch_timer_handler_phys) from [<8005de74>] (handle_percpu_devid_irq+0x70/0x8c)
[285165.964010] [<8005de04>] (handle_percpu_devid_irq) from [<8005a198>] (generic_handle_irq+0x20/0x30)
[285165.973145] r9:00023148 r8:9b408000 r7:80574fec r6:00000011 r5:00000000 r4:00000000
[285165.981006] [<8005a178>] (generic_handle_irq) from [<8005a4a0>] (__handle_domain_irq+0x94/0xbc)
[285165.989798] [<8005a40c>] (__handle_domain_irq) from [<800093b8>] (gic_handle_irq+0x58/0x9c)
[285165.998238] r9:00023148 r8:a0050000 r7:8059c2bc r6:9a35ffb0 r5:8057c760 r4:a004f000
[285166.006099] [<80009360>] (gic_handle_irq) from [<8000a748>] (__irq_usr+0x48/0x60)
[285166.013672] Exception stack(0x9a35ffb0 to 0x9a35fff8)
[285166.018811] ffa0: 00000000 00647348 ffefffff ffefffff
[285166.027083] ffc0: 00191cd2 75112000 76011808 00022f0c 003bfe02 00023148 00000001 0001869f
[285166.035352] ffe0: 00023034 7e957d48 00011d44 00011308 60000010 ffffffff
[285166.042054] r9:00023148 r8:30c5387d r7:30c5383d r6:ffffffff r5:60000010 r4:00011308
[285166.049912] Kernel panic - not syncing: RCU Stall
[285166.049912]
[285166.056272] CPU: 0 PID: 25934 Comm: memtester Tainted: P 4.4.74 #0
[285166.063843] Hardware name: Freescale LS1043A
[285166.068197] Backtrace:
[285166.070734] [<8001aec4>] (dump_backtrace) from [<8001b0bc>] (show_stack+0x18/0x1c)
[285166.078391] r7:8057c100 r6:60000193 r5:60000193 r4:00000000
[285166.084160] [<8001b0a4>] (show_stack) from [<8015cbdc>] (dump_stack+0x84/0xa4)
[285166.091476] [<8015cb58>] (dump_stack) from [<8002549c>] (panic+0x90/0x204)
[285166.098439] r5:9fdb3080 r4:8058ac40
[285166.102111] [<80025410>] (panic) from [<80065c34>] (rcu_check_callbacks+0x28c/0x6ac)
[285166.109943] r3:00000001 r2:10f5f520 r1:60000193 r0:8049a23e
[285166.115705] r7:8057c100
[285166.118329] [<800659a8>] (rcu_check_callbacks) from [<80067ec8>] (update_process_times+0x38/0x64)
[285166.127290] r10:00000001 r9:00000000 r8:805d5158 r7:00010312 r6:dcf44000 r5:00000001
[285166.135232] r4:9ac3d800
[285166.137856] [<80067e90>] (update_process_times) from [<80073cac>] (tick_periodic+0xa8/0xc0)
[285166.146296] r5:00000000 r4:8057c140
[285166.149969] [<80073c04>] (tick_periodic) from [<80073d54>] (tick_handle_periodic+0x90/0xa4)
[285166.158409] r5:00000000 r4:9fdb5580
[285166.162082] [<80073cc4>] (tick_handle_periodic) from [<802f18d4>] (arch_timer_handler_phys+0x30/0x40)
[285166.171392] r9:00023148 r8:00000011 r7:9b410880 r6:8059c1c4 r5:9b405cc0 r4:9fdb5580
[285166.179253] [<802f18a4>] (arch_timer_handler_phys) from [<8005de74>] (handle_percpu_devid_irq+0x70/0x8c)
[285166.188829] [<8005de04>] (handle_percpu_devid_irq) from [<8005a198>] (generic_handle_irq+0x20/0x30)
[285166.197965] r9:00023148 r8:9b408000 r7:80574fec r6:00000011 r5:00000000 r4:00000000
[285166.205826] [<8005a178>] (generic_handle_irq) from [<8005a4a0>] (__handle_domain_irq+0x94/0xbc)
[285166.214618] [<8005a40c>] (__handle_domain_irq) from [<800093b8>] (gic_handle_irq+0x58/0x9c)
[285166.223059] r9:00023148 r8:a0050000 r7:8059c2bc r6:9a35ffb0 r5:8057c760 r4:a004f000
[285166.230918] [<80009360>] (gic_handle_irq) from [<8000a748>] (__irq_usr+0x48/0x60)
[285166.238489] Exception stack(0x9a35ffb0 to 0x9a35fff8)
[285166.243629] ffa0: 00000000 00647348 ffefffff ffefffff
[285166.251900] ffc0: 00191cd2 75112000 76011808 00022f0c 003bfe02 00023148 00000001 0001869f
[285166.260170] ffe0: 00023034 7e957d48 00011d44 00011308 60000010 ffffffff
[285166.266871] r9:00023148 r8:30c5387d r7:30c5383d r6:ffffffff r5:60000010 r4:00011308
[285167.343488] SMP: failed to stop secondary CPUs
[285167.364376] Rebooting in 3 seconds..
[285171.436756] SMP: failed to stop secondary CPUs

2 Replies

2,654 Views
yipingwang
NXP TechSupport
NXP TechSupport

Idle CPU that is not receiving scheduling-clock interrupts is said to be "dyntick-idle".

The CONFIG_NO_HZ_IDLE=y Kconfig option causes the kernel to avoid sending scheduling-clock interrupts to idle CPUs, which is
critically important both to battery-powered devices and to highly virtualized mainframes.
Therefore, system with aggressive real-time response constraints often run CONFIG_HZ_PERIODIC=y kernels(or CONFIG_NO_HZ=n) in order to avoid degrading from-idle transition latencies.
This boot parameter "nohz=" that can be used to disable dyntick-idle mode in CONFIG_NO_HZ_IDLE=y kernel by specifying "nohz=off".

By the default, the system boot with "nohz=on", enabling dyntick-idle mode.
According to the your requirement, you should specify "nohz=off" to disable dyntick-idle.


If your problem remains, please send your Linux Kernel configuration file to me to do more investigation.

2,654 Views
saoan_ho
Contributor I

This problem still occurs.

We have a patch to remove NO_HZ_IDLE still have rcu stall.

Please find attached our Linux Kernel configuration file.

Thank you!

0 Kudos