Soft lock issue in i.mx53

pravverma · ‎03-23-2016

Hi,

We are using IMX53 based custom board & linux 2.6.35 kernel.

We have some test scripts for interfaces, while running the script the system get stuck into soft lock.

The flow of test script is as below:

====================================================================

<test routine for interface> /* perform interface test*/

<can interface up> /* Making up the can interface*/

cansend $CAN_DEVICE -i $CAN_MSG /*send the interface test result over can*/

echo " print 1"

sleep 1 /* here the system goes into soft lock mode */

echo " print 2" /* This print never comes on debug port.*/

<can interface down> /* making down the can interface */

=====================================================================

If I comment the "sleep 1", the test scripts runs fine & system never goes into soft lock.

Please help me to debug this issue. Is the problem in our can driver?

How sleep is causing the soft lock?

or there is a bug in our kernel, softlock due to sleep command is a sideeffect of that bug?

After enabling the Debug print, I can see following oops message.

BUG: soft lockup - CPU#0 stuck for 61s! [sh:1113]

Modules linked in:

Pid: 1113, comm: sh

CPU: 0 Not tainted (2.6.35-ts-armv7l+ #85)

PC is at __do_softirq+0x4c/0x128

LR is at __do_softirq+0x38/0x128

pc : [<8006fc88>] lr : [<8006fc74>] psr: 20000113

sp : 8dc35df8 ip : 8d00052c fp : 809d3778

r10: 80000013 r9 : 7f82243c r8 : 807ddb80

r7 : 0000000a r6 : 00000000 r5 : 807ddbc4 r4 : 00000040

r3 : 00000000 r2 : 8dc34000 r1 : 00000003 r0 : 807ddb80

Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user

Control: 10c5387d Table: 7dc58019 DAC: 00000015

Kernel panic - not syncing: softlockup: hung tasks

[<8004af70>] (unwind_backtrace+0x0/0xe0) from [<80387644>] (panic+0x58/0xd4)

[<80387644>] (panic+0x58/0xd4) from [<800943b0>] (softlockup_tick+0x154/0x180)

[<800943b0>] (softlockup_tick+0x154/0x180) from [<800758d4>] (update_process_times+0x30/0x50)

[<800758d4>] (update_process_times+0x30/0x50) from [<8008a968>] (tick_sched_timer+0x84/0xb8)

[<8008a968>] (tick_sched_timer+0x84/0xb8) from [<800827c8>] (__run_hrtimer.isra.23+0x84/0xd8)

[<800827c8>] (__run_hrtimer.isra.23+0x84/0xd8) from [<80082ed8>] (hrtimer_interrupt+0x124/0x2c0)

[<80082ed8>] (hrtimer_interrupt+0x124/0x2c0) from [<800562ec>] (mxc_timer_interrupt+0x24/0x34)

[<800562ec>] (mxc_timer_interrupt+0x24/0x34) from [<80094b60>] (handle_IRQ_event+0x24/0xdc)

[<80094b60>] (handle_IRQ_event+0x24/0xdc) from [<800963a8>] (handle_level_irq+0xb0/0x12c)

[<800963a8>] (handle_level_irq+0xb0/0x12c) from [<80046068>] (asm_do_IRQ+0x68/0x84)

[<80046068>] (asm_do_IRQ+0x68/0x84) from [<8004690c>] (__irq_svc+0x4c/0xcc)

Exception stack(0x8dc35db0 to 0x8dc35df8)

5da0: 807ddb80 00000003 8dc34000 00000000

5dc0: 00000040 807ddbc4 00000000 0000000a 807ddb80 7f82243c 80000013 809d3778

5de0: 8d00052c 8dc35df8 8006fc74 8006fc88 20000113 ffffffff

[<8004690c>] (__irq_svc+0x4c/0xcc) from [<8006fc88>] (__do_softirq+0x4c/0x128)

[<8006fc88>] (__do_softirq+0x4c/0x128) from [<80070138>] (irq_exit+0x48/0x98)

[<80070138>] (irq_exit+0x48/0x98) from [<8004606c>] (asm_do_IRQ+0x6c/0x84)

[<8004606c>] (asm_do_IRQ+0x6c/0x84) from [<8004690c>] (__irq_svc+0x4c/0xcc)

Exception stack(0x8dc35e48 to 0x8dc35e90)

5e40: 807aa720 80000013 8da99a80 00000001 80000013 00000000

5e60: 000000c0 807aa6f0 000000d0 00000020 80000013 809d3778 00000060 8dc35e90

5e80: 8038bbe0 8038bbe4 60000013 ffffffff

[<8004690c>] (__irq_svc+0x4c/0xcc) from [<8038bbe4>] (_raw_spin_unlock_irqrestore+0x10/0x38)

[<8038bbe4>] (_raw_spin_unlock_irqrestore+0x10/0x38) from [<800bd19c>] (slob_alloc.isra.14+0xe4/0x1f4)

[<800bd19c>] (slob_alloc.isra.14+0xe4/0x1f4) from [<800bd360>] (kmem_cache_alloc_node+0x20/0x7c)

[<800bd360>] (kmem_cache_alloc_node+0x20/0x7c) from [<800d3314>] (dup_fd+0x28/0x2f4)

[<800d3314>] (dup_fd+0x28/0x2f4) from [<80069ba0>] (copy_process.part.46+0x37c/0xbe0)

[<80069ba0>] (copy_process.part.46+0x37c/0xbe0) from [<8006a5dc>] (do_fork+0x1b4/0x380)

[<8006a5dc>] (do_fork+0x1b4/0x380) from [<80046e00>] (ret_fast_syscall+0x0/0x30)

Serial reinitilized!

Thanks,

Praveen

igorpadykov · ‎03-23-2016

Hi praveen

this is kernel feature, described on linux forums:

What is a CPU Soft Lockup | InetServicesCloud Knowledgebase

Time threshold can be changed in linux/kernel/softlockup.c

Best regards

igor

-----------------------------------------------------------------------------------------------------------------------

Note: If this post answers your question, please click the Correct Answer button. Thank you!

-----------------------------------------------------------------------------------------------------------------------

pravverma · ‎03-28-2016

On further debugging, I found the CPU is stalled by RCU.

Please see the log below:

====================================================================

INFO: rcu_sched_state detected stall on CPU 0 (t=1000 jiffies)

INFO: rcu_sched_state detected stall on CPU 0 (t=4000 jiffies)

BUG: soft lockup - CPU#0 stuck for 61s! [grep:1219]

Last softirq was rcu_process_callbacks+0x0/0x30

Modules linked in:

Pid: 1219, comm: grep

CPU: 0 Not tainted (2.6.35-ts-armv7l+ #96)

PC is at __do_softirq+0x4c/0x128

LR is at __do_softirq+0x38/0x128

pc : [<8006fd1c>] lr : [<8006fd08>] psr: 20000113

sp : 8da7dde0 ip : 00000000 fp : 7878c1cf

r10: 8db55800 r9 : 7f8195dc r8 : 807e69e0

r7 : 0000000a r6 : 00000000 r5 : 807e6a24 r4 : 00000202

r3 : 00000000 r2 : 8da7c000 r1 : 00000003 r0 : 807e69e0

Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user

Control: 10c5387d Table: 7dcf0019 DAC: 00000015

Kernel panic - not syncing: softlockup: hung tasks

[<8004ae24>] (unwind_backtrace+0x0/0xe0) from [<8038af44>] (panic+0x58/0xd4)

[<8038af44>] (panic+0x58/0xd4) from [<80094444>] (softlockup_tick+0x154/0x180)

[<80094444>] (softlockup_tick+0x154/0x180) from [<80075968>] (update_process_times+0x30/0x50)

[<80075968>] (update_process_times+0x30/0x50) from [<8008a9fc>] (tick_sched_timer+0x84/0xb8)

[<8008a9fc>] (tick_sched_timer+0x84/0xb8) from [<8008285c>] (__run_hrtimer.isra.23+0x84/0xd8)

[<8008285c>] (__run_hrtimer.isra.23+0x84/0xd8) from [<80082f6c>] (hrtimer_interrupt+0x124/0x2c0)

[<80082f6c>] (hrtimer_interrupt+0x124/0x2c0) from [<800561ac>] (mxc_timer_interrupt+0x24/0x34)

[<800561ac>] (mxc_timer_interrupt+0x24/0x34) from [<80094bf4>] (handle_IRQ_event+0x24/0xdc)

[<80094bf4>] (handle_IRQ_event+0x24/0xdc) from [<8009643c>] (handle_level_irq+0xb0/0x12c)

[<8009643c>] (handle_level_irq+0xb0/0x12c) from [<80046068>] (asm_do_IRQ+0x68/0x84)

[<80046068>] (asm_do_IRQ+0x68/0x84) from [<8004690c>] (__irq_svc+0x4c/0xcc)

Exception stack(0x8da7dd98 to 0x8da7dde0)

dd80: 807e69e0 00000003

dda0: 8da7c000 00000000 00000202 807e6a24 00000000 0000000a 807e69e0 7f8195dc

ddc0: 8db55800 7878c1cf 00000000 8da7dde0 8006fd08 8006fd1c 20000113 ffffffff

[<8004690c>] (__irq_svc+0x4c/0xcc) from [<8006fd1c>] (__do_softirq+0x4c/0x128)

[<8006fd1c>] (__do_softirq+0x4c/0x128) from [<800701cc>] (irq_exit+0x48/0x98)

[<800701cc>] (irq_exit+0x48/0x98) from [<8004606c>] (asm_do_IRQ+0x6c/0x84)

[<8004606c>] (asm_do_IRQ+0x6c/0x84) from [<8004690c>] (__irq_svc+0x4c/0xcc)

Exception stack(0x8da7de30 to 0x8da7de78)

de20: 8878cf00 8878d000 00000040 00000002

de40: 00000000 8db1a940 40100208 8dd50b70 80938180 8db55a9c 8db55800 7878c1cf

de60: 40100008 8da7de78 8004c2c0 8004defc 80000113 ffffffff

[<8004690c>] (__irq_svc+0x4c/0xcc) from [<8004defc>] (v7_flush_kern_dcache_area+0x18/0x2c)

[<8004defc>] (v7_flush_kern_dcache_area+0x18/0x2c) from [<8004c2c0>] (__flush_dcache_page+0x2c/0x38)

[<8004c2c0>] (__flush_dcache_page+0x2c/0x38) from [<8004c1e0>] (update_mmu_cache+0x8c/0xb8)

[<8004c1e0>] (update_mmu_cache+0x8c/0xb8) from [<800af3f4>] (handle_mm_fault+0x6e4/0x720)

[<800af3f4>] (handle_mm_fault+0x6e4/0x720) from [<8004bc98>] (do_page_fault+0x100/0x204)

[<8004bc98>] (do_page_fault+0x100/0x204) from [<800461e4>] (do_DataAbort+0x34/0x98)

[<800461e4>] (do_DataAbort+0x34/0x98) from [<80046da0>] (ret_from_exception+0x0/0x10)

===========================================================================

I am not sure why RCU is causing stall. can anybody suggest something?

Regards,

Praveen

pravverma · ‎03-24-2016

Hi Igor,

Thanks for the link, but I still have doubts:

1. In my script if I comment the "sleep 1", soft lock does not occur. can "sleep 1" cause soft lock?

2. In the kernel oops message, the PC & LR both are at __do_softirq+.../... Does it mean the softirq is causing problem? I am unsure how to start debugging using above oops message.

Any Input will be helpful.

Thanks,

Praveen

Soft lock issue in i.mx53

Soft lock issue in i.mx53

i.MX53

Linux