Hi,
We are using IMX53 based custom board & linux 2.6.35 kernel.
We have some test scripts for interfaces, while running the script the system get stuck into soft lock.
The flow of test script is as below:
====================================================================
<test routine for interface> /* perform interface test*/
<can interface up> /* Making up the can interface*/
cansend $CAN_DEVICE -i $CAN_MSG /*send the interface test result over can*/
echo " print 1"
sleep 1 /* here the system goes into soft lock mode */
echo " print 2" /* This print never comes on debug port.*/
<can interface down> /* making down the can interface */
=====================================================================
If I comment the "sleep 1", the test scripts runs fine & system never goes into soft lock.
Please help me to debug this issue. Is the problem in our can driver?
How sleep is causing the soft lock?
or there is a bug in our kernel, softlock due to sleep command is a sideeffect of that bug?
After enabling the Debug print, I can see following oops message.
BUG: soft lockup - CPU#0 stuck for 61s! [sh:1113]
Modules linked in:
Pid: 1113, comm: sh
CPU: 0 Not tainted (2.6.35-ts-armv7l+ #85)
PC is at __do_softirq+0x4c/0x128
LR is at __do_softirq+0x38/0x128
pc : [<8006fc88>] lr : [<8006fc74>] psr: 20000113
sp : 8dc35df8 ip : 8d00052c fp : 809d3778
r10: 80000013 r9 : 7f82243c r8 : 807ddb80
r7 : 0000000a r6 : 00000000 r5 : 807ddbc4 r4 : 00000040
r3 : 00000000 r2 : 8dc34000 r1 : 00000003 r0 : 807ddb80
Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
Control: 10c5387d Table: 7dc58019 DAC: 00000015
Kernel panic - not syncing: softlockup: hung tasks
[<8004af70>] (unwind_backtrace+0x0/0xe0) from [<80387644>] (panic+0x58/0xd4)
[<80387644>] (panic+0x58/0xd4) from [<800943b0>] (softlockup_tick+0x154/0x180)
[<800943b0>] (softlockup_tick+0x154/0x180) from [<800758d4>] (update_process_times+0x30/0x50)
[<800758d4>] (update_process_times+0x30/0x50) from [<8008a968>] (tick_sched_timer+0x84/0xb8)
[<8008a968>] (tick_sched_timer+0x84/0xb8) from [<800827c8>] (__run_hrtimer.isra.23+0x84/0xd8)
[<800827c8>] (__run_hrtimer.isra.23+0x84/0xd8) from [<80082ed8>] (hrtimer_interrupt+0x124/0x2c0)
[<80082ed8>] (hrtimer_interrupt+0x124/0x2c0) from [<800562ec>] (mxc_timer_interrupt+0x24/0x34)
[<800562ec>] (mxc_timer_interrupt+0x24/0x34) from [<80094b60>] (handle_IRQ_event+0x24/0xdc)
[<80094b60>] (handle_IRQ_event+0x24/0xdc) from [<800963a8>] (handle_level_irq+0xb0/0x12c)
[<800963a8>] (handle_level_irq+0xb0/0x12c) from [<80046068>] (asm_do_IRQ+0x68/0x84)
[<80046068>] (asm_do_IRQ+0x68/0x84) from [<8004690c>] (__irq_svc+0x4c/0xcc)
Exception stack(0x8dc35db0 to 0x8dc35df8)
5da0: 807ddb80 00000003 8dc34000 00000000
5dc0: 00000040 807ddbc4 00000000 0000000a 807ddb80 7f82243c 80000013 809d3778
5de0: 8d00052c 8dc35df8 8006fc74 8006fc88 20000113 ffffffff
[<8004690c>] (__irq_svc+0x4c/0xcc) from [<8006fc88>] (__do_softirq+0x4c/0x128)
[<8006fc88>] (__do_softirq+0x4c/0x128) from [<80070138>] (irq_exit+0x48/0x98)
[<80070138>] (irq_exit+0x48/0x98) from [<8004606c>] (asm_do_IRQ+0x6c/0x84)
[<8004606c>] (asm_do_IRQ+0x6c/0x84) from [<8004690c>] (__irq_svc+0x4c/0xcc)
Exception stack(0x8dc35e48 to 0x8dc35e90)
5e40: 807aa720 80000013 8da99a80 00000001 80000013 00000000
5e60: 000000c0 807aa6f0 000000d0 00000020 80000013 809d3778 00000060 8dc35e90
5e80: 8038bbe0 8038bbe4 60000013 ffffffff
[<8004690c>] (__irq_svc+0x4c/0xcc) from [<8038bbe4>] (_raw_spin_unlock_irqrestore+0x10/0x38)
[<8038bbe4>] (_raw_spin_unlock_irqrestore+0x10/0x38) from [<800bd19c>] (slob_alloc.isra.14+0xe4/0x1f4)
[<800bd19c>] (slob_alloc.isra.14+0xe4/0x1f4) from [<800bd360>] (kmem_cache_alloc_node+0x20/0x7c)
[<800bd360>] (kmem_cache_alloc_node+0x20/0x7c) from [<800d3314>] (dup_fd+0x28/0x2f4)
[<800d3314>] (dup_fd+0x28/0x2f4) from [<80069ba0>] (copy_process.part.46+0x37c/0xbe0)
[<80069ba0>] (copy_process.part.46+0x37c/0xbe0) from [<8006a5dc>] (do_fork+0x1b4/0x380)
[<8006a5dc>] (do_fork+0x1b4/0x380) from [<80046e00>] (ret_fast_syscall+0x0/0x30)
Serial reinitilized!
Thanks,
Praveen
Hi praveen
this is kernel feature, described on linux forums:
What is a CPU Soft Lockup | InetServicesCloud Knowledgebase
Time threshold can be changed in linux/kernel/softlockup.c
Best regards
igor
-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------
On further debugging, I found the CPU is stalled by RCU.
Please see the log below:
====================================================================
INFO: rcu_sched_state detected stall on CPU 0 (t=1000 jiffies)
INFO: rcu_sched_state detected stall on CPU 0 (t=4000 jiffies)
BUG: soft lockup - CPU#0 stuck for 61s! [grep:1219]
Last softirq was rcu_process_callbacks+0x0/0x30
Modules linked in:
Pid: 1219, comm: grep
CPU: 0 Not tainted (2.6.35-ts-armv7l+ #96)
PC is at __do_softirq+0x4c/0x128
LR is at __do_softirq+0x38/0x128
pc : [<8006fd1c>] lr : [<8006fd08>] psr: 20000113
sp : 8da7dde0 ip : 00000000 fp : 7878c1cf
r10: 8db55800 r9 : 7f8195dc r8 : 807e69e0
r7 : 0000000a r6 : 00000000 r5 : 807e6a24 r4 : 00000202
r3 : 00000000 r2 : 8da7c000 r1 : 00000003 r0 : 807e69e0
Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
Control: 10c5387d Table: 7dcf0019 DAC: 00000015
Kernel panic - not syncing: softlockup: hung tasks
[<8004ae24>] (unwind_backtrace+0x0/0xe0) from [<8038af44>] (panic+0x58/0xd4)
[<8038af44>] (panic+0x58/0xd4) from [<80094444>] (softlockup_tick+0x154/0x180)
[<80094444>] (softlockup_tick+0x154/0x180) from [<80075968>] (update_process_times+0x30/0x50)
[<80075968>] (update_process_times+0x30/0x50) from [<8008a9fc>] (tick_sched_timer+0x84/0xb8)
[<8008a9fc>] (tick_sched_timer+0x84/0xb8) from [<8008285c>] (__run_hrtimer.isra.23+0x84/0xd8)
[<8008285c>] (__run_hrtimer.isra.23+0x84/0xd8) from [<80082f6c>] (hrtimer_interrupt+0x124/0x2c0)
[<80082f6c>] (hrtimer_interrupt+0x124/0x2c0) from [<800561ac>] (mxc_timer_interrupt+0x24/0x34)
[<800561ac>] (mxc_timer_interrupt+0x24/0x34) from [<80094bf4>] (handle_IRQ_event+0x24/0xdc)
[<80094bf4>] (handle_IRQ_event+0x24/0xdc) from [<8009643c>] (handle_level_irq+0xb0/0x12c)
[<8009643c>] (handle_level_irq+0xb0/0x12c) from [<80046068>] (asm_do_IRQ+0x68/0x84)
[<80046068>] (asm_do_IRQ+0x68/0x84) from [<8004690c>] (__irq_svc+0x4c/0xcc)
Exception stack(0x8da7dd98 to 0x8da7dde0)
dd80: 807e69e0 00000003
dda0: 8da7c000 00000000 00000202 807e6a24 00000000 0000000a 807e69e0 7f8195dc
ddc0: 8db55800 7878c1cf 00000000 8da7dde0 8006fd08 8006fd1c 20000113 ffffffff
[<8004690c>] (__irq_svc+0x4c/0xcc) from [<8006fd1c>] (__do_softirq+0x4c/0x128)
[<8006fd1c>] (__do_softirq+0x4c/0x128) from [<800701cc>] (irq_exit+0x48/0x98)
[<800701cc>] (irq_exit+0x48/0x98) from [<8004606c>] (asm_do_IRQ+0x6c/0x84)
[<8004606c>] (asm_do_IRQ+0x6c/0x84) from [<8004690c>] (__irq_svc+0x4c/0xcc)
Exception stack(0x8da7de30 to 0x8da7de78)
de20: 8878cf00 8878d000 00000040 00000002
de40: 00000000 8db1a940 40100208 8dd50b70 80938180 8db55a9c 8db55800 7878c1cf
de60: 40100008 8da7de78 8004c2c0 8004defc 80000113 ffffffff
[<8004690c>] (__irq_svc+0x4c/0xcc) from [<8004defc>] (v7_flush_kern_dcache_area+0x18/0x2c)
[<8004defc>] (v7_flush_kern_dcache_area+0x18/0x2c) from [<8004c2c0>] (__flush_dcache_page+0x2c/0x38)
[<8004c2c0>] (__flush_dcache_page+0x2c/0x38) from [<8004c1e0>] (update_mmu_cache+0x8c/0xb8)
[<8004c1e0>] (update_mmu_cache+0x8c/0xb8) from [<800af3f4>] (handle_mm_fault+0x6e4/0x720)
[<800af3f4>] (handle_mm_fault+0x6e4/0x720) from [<8004bc98>] (do_page_fault+0x100/0x204)
[<8004bc98>] (do_page_fault+0x100/0x204) from [<800461e4>] (do_DataAbort+0x34/0x98)
[<800461e4>] (do_DataAbort+0x34/0x98) from [<80046da0>] (ret_from_exception+0x0/0x10)
===========================================================================
I am not sure why RCU is causing stall. can anybody suggest something?
Regards,
Praveen
Hi Igor,
Thanks for the link, but I still have doubts:
1. In my script if I comment the "sleep 1", soft lock does not occur. can "sleep 1" cause soft lock?
2. In the kernel oops message, the PC & LR both are at __do_softirq+.../... Does it mean the softirq is causing problem? I am unsure how to start debugging using above oops message.
Any Input will be helpful.
Thanks,
Praveen