We have a product based on using a Greyscale and a Bayer sensor (the first on CSI0, the second on a MIPI channel). The system is based on 3.0.35, with extensive changes to enable greyscale and bayer capture, reserved kernel memory for capture and to fix a bug in the MIPI channel handling which assumed that the MIPI capture channel was always zero. The changes are in a set of patches attached in another thread (Re: i.MX6 OV5647 Bayer sensor driver (ERROR: v4l2 capture: mxc_v4l_dqueue timeout enc_counter 0)
The system works pretty well. However, we are occasionally (rarely) seeing a kernel oops in mxc_v4l2_capture mxc_v4l_dqueue:
BUG: soft lockup - CPU#2 stuck for 22s! [app:31294]
Modules linked in: ppp_deflate bsd_comp ppp_async crc_ccitt ppp_generic slhc tun bma250 mxc_v4l2_capture ipu_fg_overlay_sdc ipu_csi_enc ipu_prp_enc ipu_still ipu_bg_overlay_sdc option usb_wwan wl12xx_sdio usbserial mt9m031_camera ov5647_camera_mipi camera_sensor_clock
Pid: 31294, comm: app
CPU: 2 Not tainted (3.0.35+ #107)
PC is at __raw_spin_lock_irqsave+0x84/0xa0
LR is at __raw_spin_lock_irqsave+0x54/0xa0
pc : [<805ae3cc>] lr : [<805ae39c>] psr: 200f0013
sp : a2603948 ip : 00000001 fp : 00000002
r10: 00000000 r9 : bfcc0800 r8 : 00000017
r7 : 00000001 r6 : a2602000 r5 : 00000001 r4 : 80c883a8
r3 : 00000001 r2 : 00000000 r1 : 00000000 r0 : 00000001
Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
Control: 10c53c7d Table: 3330004a DAC: 00000015
Kernel panic - not syncing: softlockup: hung tasks
[<8004f038>] (unwind_backtrace+0x0/0xf8) from [<805a608c>] (panic+0x64/0x184)
[<805a608c>] (panic+0x64/0x184) from [<800b7d20>] (watchdog_timer_fn+0x178/0x194)
[<800b7d20>] (watchdog_timer_fn+0x178/0x194) from [<8009e2c4>] (__run_hrtimer.isra.19+0x44/0xd8)
[<8009e2c4>] (__run_hrtimer.isra.19+0x44/0xd8) from [<8009ec44>] (hrtimer_interrupt+0xf8/0x274)
[<8009ec44>] (hrtimer_interrupt+0xf8/0x274) from [<8004da24>] (ipi_timer+0x28/0x2c)
[<8004da24>] (ipi_timer+0x28/0x2c) from [<800421a0>] (do_local_timer+0x5c/0x7c)
[<800421a0>] (do_local_timer+0x5c/0x7c) from [<800478cc>] (__irq_svc+0x4c/0xe8)
Exception stack(0xa2603900 to 0xa2603948)
3900: 00000001 00000000 00000000 00000001 80c883a8 00000001 a2602000 00000001
3920: 00000017 bfcc0800 00000000 00000002 00000001 a2603948 805ae39c 805ae3cc
3940: 200f0013 ffffffff
[<800478cc>] (__irq_svc+0x4c/0xe8) from [<805ae3cc>] (__raw_spin_lock_irqsave+0x84/0xa0)
[<805ae3cc>] (__raw_spin_lock_irqsave+0x84/0xa0) from [<803f0c28>] (ipu_clear_irq+0x24/0x8c)
[<803f0c28>] (ipu_clear_irq+0x24/0x8c) from [<8028e074>] (mxcfb_pan_display+0x324/0x530)
[<8028e074>] (mxcfb_pan_display+0x324/0x530) from [<80274ed0>] (fb_pan_display+0xbc/0x13c)
[<80274ed0>] (fb_pan_display+0xbc/0x13c) from [<80281f84>] (bit_update_start+0x14/0x30)
[<80281f84>] (bit_update_start+0x14/0x30) from [<8027f8e4>] (fbcon_switch+0x2a4/0x4a8)
[<8027f8e4>] (fbcon_switch+0x2a4/0x4a8) from [<802b1ff8>] (redraw_screen+0x14c/0x244)
[<802b1ff8>] (redraw_screen+0x14c/0x244) from [<8027ef3c>] (fbcon_blank+0x1b4/0x258)
[<8027ef3c>] (fbcon_blank+0x1b4/0x258) from [<802b3604>] (do_unblank_screen+0x94/0x17c)
[<802b3604>] (do_unblank_screen+0x94/0x17c) from [<802568b0>] (bust_spinlocks+0x24/0x4c)
[<802568b0>] (bust_spinlocks+0x24/0x4c) from [<8004b440>] (die+0xa8/0x294)
[<8004b440>] (die+0xa8/0x294) from [<805a543c>] (__do_kernel_fault.part.3+0x54/0x74)
[<805a543c>] (__do_kernel_fault.part.3+0x54/0x74) from [<80052d28>] (do_page_fault+0x2b4/0x31c)
[<80052d28>] (do_page_fault+0x2b4/0x31c) from [<80042338>] (do_DataAbort+0x34/0x9c)
[<80042338>] (do_DataAbort+0x34/0x9c) from [<80047850>] (__dabt_svc+0x70/0xa0)
Exception stack(0xa2603da0 to 0xa2603de8)
3da0: ba65c06c 73bbef20 00000068 a2603e5c a2603e48 00000000 ba65c000 ba65c008
3dc0: ba65c088 ba65c06c 600f0013 b9912140 7f1198d8 a2603de8 8007a3e0 7f11a760
3de0: 200f0093 ffffffff
[<80047850>] (__dabt_svc+0x70/0xa0) from [<7f11a760>] (mxc_v4l_do_ioctl+0xe88/0x2260 [mxc_v4l2_capture])
[<7f11a760>] (mxc_v4l_do_ioctl+0xe88/0x2260 [mxc_v4l2_capture]) from [<803ba970>] (video_usercopy+0x98/0x4cc)
[<803ba970>] (video_usercopy+0x98/0x4cc) from [<803b9f0c>] (v4l2_ioctl+0x11c/0x138)
[<803b9f0c>] (v4l2_ioctl+0x11c/0x138) from [<8010e4d4>] (do_vfs_ioctl+0x80/0x54c)
[<8010e4d4>] (do_vfs_ioctl+0x80/0x54c) from [<8010e9d8>] (sys_ioctl+0x38/0x5c)
[<8010e9d8>] (sys_ioctl+0x38/0x5c) from [<80047e80>] (ret_fast_syscall+0x0/0x30)
CPU3: stopping
[<8004f038>] (unwind_backtrace+0x0/0xf8) from [<800422dc>] (do_IPI+0x11c/0x144)
[<800422dc>] (do_IPI+0x11c/0x144) from [<800478cc>] (__irq_svc+0x4c/0xe8)
Exception stack(0xbff07f98 to 0xbff07fe0)
7f80: 00000001 00000188
7fa0: 00000000 00000000 bff06000 80c67f84 805b042c 80bff394 1000406a 412fc09a
7fc0: 00000000 00000000 00000005 bff07fe0 800a9418 80048b30 200f0013 ffffffff
[<800478cc>] (__irq_svc+0x4c/0xe8) from [<80048b30>] (cpu_idle+0x38/0xfc)
[<80048b30>] (cpu_idle+0x38/0xfc) from [<105a23b4>] (0x105a23b4)
CPU1: stopping
[<8004f038>] (unwind_backtrace+0x0/0xf8) from [<800422dc>] (do_IPI+0x11c/0x144)
[<800422dc>] (do_IPI+0x11c/0x144) from [<800478cc>] (__irq_svc+0x4c/0xe8)
Exception stack(0xbffaff98 to 0xbffaffe0)
ff80: 00000001 00000188
ffa0: 00000000 00000000 bffae000 80c67f84 805b042c 80bff394 1000406a 412fc09a
ffc0: 00000000 00000000 00000005 bffaffe0 800a9418 80048b30 200f0013 ffffffff
[<800478cc>] (__irq_svc+0x4c/0xe8) from [<80048b30>] (cpu_idle+0x38/0xfc)
[<80048b30>] (cpu_idle+0x38/0xfc) from [<105a23b4>] (0x105a23b4)
SMP: failed to stop secondary CPUs
Running under kgdb, I was able to trace this to a segfault due to a corrupted variable (frame) in the following:
/*!
* Dequeue one V4L capture buffer
*
* @param cam structure cam_data *
* @param buf structure v4l2_buffer *
*
* @return status 0 success, EINVAL invalid frame number,
* ETIME timeout, ERESTARTSYS interrupted by user
*/
static int mxc_v4l_dqueue(cam_data *cam, struct v4l2_buffer *buf)
{
int retval = 0;
struct mxc_v4l_frame *frame;
unsigned long lock_flags;
pr_debug("In MVC:mxc_v4l_dqueue\n");
if (!wait_event_interruptible_timeout(cam->enc_queue,
cam->enc_counter != 0, 10 * HZ)) {
pr_err("ERROR: v4l2 capture: mxc_v4l_dqueue timeout "
"enc_counter %x\n",
cam->enc_counter);
return -ETIME;
} else if (signal_pending(current)) {
pr_err("ERROR: v4l2 capture: mxc_v4l_dqueue() "
"interrupt received\n");
return -ERESTARTSYS;
}
if (down_interruptible(&cam->busy_lock))
return -EBUSY;
spin_lock_irqsave(&cam->dqueue_int_lock, lock_flags);
cam->enc_counter--;
frame = list_entry(cam->done_q.next, struct mxc_v4l_frame, queue);
list_del(cam->done_q.next);
if (frame->buffer.flags & V4L2_BUF_FLAG_DONE) {
frame->buffer.flags &= ~V4L2_BUF_FLAG_DONE;
} else if (frame->buffer.flags & V4L2_BUF_FLAG_QUEUED) {
pr_err("ERROR: v4l2 capture: VIDIOC_DQBUF: "
"Buffer not filled.\n");
frame->buffer.flags &= ~V4L2_BUF_FLAG_QUEUED;
retval = -EINVAL;
} else if ((frame->buffer.flags & 0x7) == V4L2_BUF_FLAG_MAPPED) {
pr_err("ERROR: v4l2 capture: VIDIOC_DQBUF: "
"Buffer not queued.\n");
retval = -EINVAL;
}
buf->bytesused = cam->v2f.fmt.pix.sizeimage;
buf->index = frame->index;
buf->flags = frame->buffer.flags;
buf->m = cam->frame[frame->index].buffer.m;
buf->timestamp = cam->frame[frame->index].buffer.timestamp;
spin_unlock_irqrestore(&cam->dqueue_int_lock, lock_flags);
up(&cam->busy_lock);
return retval;
}
Essentially, the frame variable, as retrieved from the done_q, is invalid. The bug can be made to happen only during severe loading of the system (2 x capture, encoding, all four cores pinned with processing) for hours at a time. Other variables within this stack frame and above stack frames look fine, but the frame variable, although it appears to be a valid pointer, seems to point to garbage:
The kernel seems to always stop here and always with the same problem: frame is corrupt, everything else is okay.
So my questions are as follows:
Thanks in advance for any help offered.
Dave
Hi Dave
you can try recent bug fixes for L3.0.35_4.1.0 provided below
IMX6Q system hang-up problem / linux kernel(3.0.35)
L3.0.101_4.1.1_SOURCE_BSP : i.MX 6Quad, i.MX 6Dual, i.MX 6DualLite, i.MX 6Solo and i.MX 6Sololite Linux BSP Source Code
Best regards
igor
Thanks Igor,
I will give these fixes a try and let you know. To confirm, the patches I will try to apply are:
0001-ENGR00276652-ARM-mx6-Use-IOMUXC-irq-instead-of-125.patch
0009-ARM-imx-return-zero-in-case-next-event-gets-a-large-.patch
0010-ENGR00295714-GPT-Status-register-bits-are-cleared-in.patch
0020-ENGR00300417-MX6SL-Add-100M-and-24M-MMDC-parameters-.patch
0021-ENGR00267929-MX6SL-Add-DDR3-support-for-MX6SL.patch
0034-ENGR00316182-imx6x-Fix-bugs-in-clock-and-cpufreq-cod.patch
0035-ENGR00316180-iMX6x-Support-IRAM-page-table-when-DDR-.patch
0037-ENGR00320182-iMX6DQ-iMX6DL-Fix-bug-in-DLL-off-mode-c.patch
0063-ENGR00331050-imx6qdl-Fix-the-workaround-for-ERR00577.patch
0068-ENGR00331611-imx6x-Set-SCU-CPU-Power-status-register.patch
0069-ENGR00327364-imx6x-Ensure-that-the-bandgap-self-bias.patch
0070-ENGR00334447-imx6qdl-Fix-random-failures-caused-by-d.patch
0071-MLK-9663-imx6x-Fix-the-DDR-self-refresh-entry-proced.patch
0072-MLK-9664-imx6x-Remove-unused-commented-code.patch
0075-MLK-9685-ARM-imx6x-Ensure-that-the-kernel-can-boot-w.patch
0076-MLK-9698-ARM-imx6x-Fix-build-break-when-CONFIG_SMP-i.patch
I am using the Variscite fork of the kernel with many patches of our own, so I will have to apply these patches carefully.
Best,
Dave