i.MX6Q Concurrent Capture Kernel Oops

davemcmordie · ‎10-30-2014

We have a product based on using a Greyscale and a Bayer sensor (the first on CSI0, the second on a MIPI channel). The system is based on 3.0.35, with extensive changes to enable greyscale and bayer capture, reserved kernel memory for capture and to fix a bug in the MIPI channel handling which assumed that the MIPI capture channel was always zero. The changes are in a set of patches attached in another thread (Re: i.MX6 OV5647 Bayer sensor driver (ERROR: v4l2 capture: mxc_v4l_dqueue timeout enc_counter 0)

The system works pretty well. However, we are occasionally (rarely) seeing a kernel oops in mxc_v4l2_capture mxc_v4l_dqueue:

BUG: soft lockup - CPU#2 stuck for 22s! [app:31294]
Modules linked in: ppp_deflate bsd_comp ppp_async crc_ccitt ppp_generic slhc tun bma250 mxc_v4l2_capture ipu_fg_overlay_sdc ipu_csi_enc ipu_prp_enc ipu_still ipu_bg_overlay_sdc option usb_wwan wl12xx_sdio usbserial mt9m031_camera ov5647_camera_mipi camera_sensor_clock


Pid: 31294, comm:                  app
CPU: 2    Not tainted  (3.0.35+ #107)
PC is at __raw_spin_lock_irqsave+0x84/0xa0
LR is at __raw_spin_lock_irqsave+0x54/0xa0
pc : [<805ae3cc>]    lr : [<805ae39c>]    psr: 200f0013
sp : a2603948  ip : 00000001  fp : 00000002
r10: 00000000  r9 : bfcc0800  r8 : 00000017
r7 : 00000001  r6 : a2602000  r5 : 00000001  r4 : 80c883a8
r3 : 00000001  r2 : 00000000  r1 : 00000000  r0 : 00000001
Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 10c53c7d  Table: 3330004a  DAC: 00000015
Kernel panic - not syncing: softlockup: hung tasks
[<8004f038>] (unwind_backtrace+0x0/0xf8) from [<805a608c>] (panic+0x64/0x184)
[<805a608c>] (panic+0x64/0x184) from [<800b7d20>] (watchdog_timer_fn+0x178/0x194)
[<800b7d20>] (watchdog_timer_fn+0x178/0x194) from [<8009e2c4>] (__run_hrtimer.isra.19+0x44/0xd8)
[<8009e2c4>] (__run_hrtimer.isra.19+0x44/0xd8) from [<8009ec44>] (hrtimer_interrupt+0xf8/0x274)
[<8009ec44>] (hrtimer_interrupt+0xf8/0x274) from [<8004da24>] (ipi_timer+0x28/0x2c)
[<8004da24>] (ipi_timer+0x28/0x2c) from [<800421a0>] (do_local_timer+0x5c/0x7c)
[<800421a0>] (do_local_timer+0x5c/0x7c) from [<800478cc>] (__irq_svc+0x4c/0xe8)
Exception stack(0xa2603900 to 0xa2603948)
3900: 00000001 00000000 00000000 00000001 80c883a8 00000001 a2602000 00000001
3920: 00000017 bfcc0800 00000000 00000002 00000001 a2603948 805ae39c 805ae3cc
3940: 200f0013 ffffffff
[<800478cc>] (__irq_svc+0x4c/0xe8) from [<805ae3cc>] (__raw_spin_lock_irqsave+0x84/0xa0)
[<805ae3cc>] (__raw_spin_lock_irqsave+0x84/0xa0) from [<803f0c28>] (ipu_clear_irq+0x24/0x8c)
[<803f0c28>] (ipu_clear_irq+0x24/0x8c) from [<8028e074>] (mxcfb_pan_display+0x324/0x530)
[<8028e074>] (mxcfb_pan_display+0x324/0x530) from [<80274ed0>] (fb_pan_display+0xbc/0x13c)
[<80274ed0>] (fb_pan_display+0xbc/0x13c) from [<80281f84>] (bit_update_start+0x14/0x30)
[<80281f84>] (bit_update_start+0x14/0x30) from [<8027f8e4>] (fbcon_switch+0x2a4/0x4a8)
[<8027f8e4>] (fbcon_switch+0x2a4/0x4a8) from [<802b1ff8>] (redraw_screen+0x14c/0x244)
[<802b1ff8>] (redraw_screen+0x14c/0x244) from [<8027ef3c>] (fbcon_blank+0x1b4/0x258)
[<8027ef3c>] (fbcon_blank+0x1b4/0x258) from [<802b3604>] (do_unblank_screen+0x94/0x17c)
[<802b3604>] (do_unblank_screen+0x94/0x17c) from [<802568b0>] (bust_spinlocks+0x24/0x4c)
[<802568b0>] (bust_spinlocks+0x24/0x4c) from [<8004b440>] (die+0xa8/0x294)
[<8004b440>] (die+0xa8/0x294) from [<805a543c>] (__do_kernel_fault.part.3+0x54/0x74)
[<805a543c>] (__do_kernel_fault.part.3+0x54/0x74) from [<80052d28>] (do_page_fault+0x2b4/0x31c)
[<80052d28>] (do_page_fault+0x2b4/0x31c) from [<80042338>] (do_DataAbort+0x34/0x9c)
[<80042338>] (do_DataAbort+0x34/0x9c) from [<80047850>] (__dabt_svc+0x70/0xa0)
Exception stack(0xa2603da0 to 0xa2603de8)
3da0: ba65c06c 73bbef20 00000068 a2603e5c a2603e48 00000000 ba65c000 ba65c008
3dc0: ba65c088 ba65c06c 600f0013 b9912140 7f1198d8 a2603de8 8007a3e0 7f11a760
3de0: 200f0093 ffffffff
[<80047850>] (__dabt_svc+0x70/0xa0) from [<7f11a760>] (mxc_v4l_do_ioctl+0xe88/0x2260 [mxc_v4l2_capture])
[<7f11a760>] (mxc_v4l_do_ioctl+0xe88/0x2260 [mxc_v4l2_capture]) from [<803ba970>] (video_usercopy+0x98/0x4cc)
[<803ba970>] (video_usercopy+0x98/0x4cc) from [<803b9f0c>] (v4l2_ioctl+0x11c/0x138)
[<803b9f0c>] (v4l2_ioctl+0x11c/0x138) from [<8010e4d4>] (do_vfs_ioctl+0x80/0x54c)
[<8010e4d4>] (do_vfs_ioctl+0x80/0x54c) from [<8010e9d8>] (sys_ioctl+0x38/0x5c)
[<8010e9d8>] (sys_ioctl+0x38/0x5c) from [<80047e80>] (ret_fast_syscall+0x0/0x30)
CPU3: stopping
[<8004f038>] (unwind_backtrace+0x0/0xf8) from [<800422dc>] (do_IPI+0x11c/0x144)
[<800422dc>] (do_IPI+0x11c/0x144) from [<800478cc>] (__irq_svc+0x4c/0xe8)
Exception stack(0xbff07f98 to 0xbff07fe0)
7f80:                                                       00000001 00000188
7fa0: 00000000 00000000 bff06000 80c67f84 805b042c 80bff394 1000406a 412fc09a
7fc0: 00000000 00000000 00000005 bff07fe0 800a9418 80048b30 200f0013 ffffffff
[<800478cc>] (__irq_svc+0x4c/0xe8) from [<80048b30>] (cpu_idle+0x38/0xfc)
[<80048b30>] (cpu_idle+0x38/0xfc) from [<105a23b4>] (0x105a23b4)
CPU1: stopping
[<8004f038>] (unwind_backtrace+0x0/0xf8) from [<800422dc>] (do_IPI+0x11c/0x144)
[<800422dc>] (do_IPI+0x11c/0x144) from [<800478cc>] (__irq_svc+0x4c/0xe8)
Exception stack(0xbffaff98 to 0xbffaffe0)
ff80:                                                       00000001 00000188
ffa0: 00000000 00000000 bffae000 80c67f84 805b042c 80bff394 1000406a 412fc09a
ffc0: 00000000 00000000 00000005 bffaffe0 800a9418 80048b30 200f0013 ffffffff
[<800478cc>] (__irq_svc+0x4c/0xe8) from [<80048b30>] (cpu_idle+0x38/0xfc)
[<80048b30>] (cpu_idle+0x38/0xfc) from [<105a23b4>] (0x105a23b4)
SMP: failed to stop secondary CPUs

Running under kgdb, I was able to trace this to a segfault due to a corrupted variable (frame) in the following:

/*!
 * Dequeue one V4L capture buffer
 *
 * @param cam         structure cam_data *
 * @param buf         structure v4l2_buffer *
 *
 * @return  status    0 success, EINVAL invalid frame number,
 *                    ETIME timeout, ERESTARTSYS interrupted by user
 */
static int mxc_v4l_dqueue(cam_data *cam, struct v4l2_buffer *buf)
{
  int retval = 0;
  struct mxc_v4l_frame *frame;
  unsigned long lock_flags;


  pr_debug("In MVC:mxc_v4l_dqueue\n");


  if (!wait_event_interruptible_timeout(cam->enc_queue,
       cam->enc_counter != 0, 10 * HZ)) {
  pr_err("ERROR: v4l2 capture: mxc_v4l_dqueue timeout "
  "enc_counter %x\n",
        cam->enc_counter);
  return -ETIME;
  } else if (signal_pending(current)) {
  pr_err("ERROR: v4l2 capture: mxc_v4l_dqueue() "
  "interrupt received\n");
  return -ERESTARTSYS;
  }


  if (down_interruptible(&cam->busy_lock))
  return -EBUSY;


  spin_lock_irqsave(&cam->dqueue_int_lock, lock_flags);
  cam->enc_counter--;


  frame = list_entry(cam->done_q.next, struct mxc_v4l_frame, queue);
  list_del(cam->done_q.next);
  if (frame->buffer.flags & V4L2_BUF_FLAG_DONE) {
  frame->buffer.flags &= ~V4L2_BUF_FLAG_DONE;
  } else if (frame->buffer.flags & V4L2_BUF_FLAG_QUEUED) {
  pr_err("ERROR: v4l2 capture: VIDIOC_DQBUF: "
  "Buffer not filled.\n");
  frame->buffer.flags &= ~V4L2_BUF_FLAG_QUEUED;
  retval = -EINVAL;
  } else if ((frame->buffer.flags & 0x7) == V4L2_BUF_FLAG_MAPPED) {
  pr_err("ERROR: v4l2 capture: VIDIOC_DQBUF: "
  "Buffer not queued.\n");
  retval = -EINVAL;
  }


  buf->bytesused = cam->v2f.fmt.pix.sizeimage;
  buf->index = frame->index;
  buf->flags = frame->buffer.flags;
  buf->m = cam->frame[frame->index].buffer.m;
  buf->timestamp = cam->frame[frame->index].buffer.timestamp;
  spin_unlock_irqrestore(&cam->dqueue_int_lock, lock_flags);


  up(&cam->busy_lock);
  return retval;
}

Essentially, the frame variable, as retrieved from the done_q, is invalid. The bug can be made to happen only during severe loading of the system (2 x capture, encoding, all four cores pinned with processing) for hours at a time. Other variables within this stack frame and above stack frames look fine, but the frame variable, although it appears to be a valid pointer, seems to point to garbage:

The kernel seems to always stop here and always with the same problem: frame is corrupt, everything else is okay.

So my questions are as follows:

What could cause the frame variable to become corrupted?
Is it possible that there is a concurrency problem between the two instances of the mxc_v4l2_capture module?
What can I do to track down the source of the corruption of this variable?

Thanks in advance for any help offered.

Dave

igorpadykov · ‎11-03-2014

Hi Dave

you can try recent bug fixes for L3.0.35_4.1.0 provided below

IMX6Q system hang-up problem / linux kernel(3.0.35)

L3.0.101_4.1.1_SOURCE_BSP : i.MX 6Quad, i.MX 6Dual, i.MX 6DualLite, i.MX 6Solo and i.MX 6Sololite Linux BSP Source Code

Best regards

igor

davemcmordie · ‎11-07-2014

Thanks Igor,

I will give these fixes a try and let you know. To confirm, the patches I will try to apply are:

0001-ENGR00276652-ARM-mx6-Use-IOMUXC-irq-instead-of-125.patch

0009-ARM-imx-return-zero-in-case-next-event-gets-a-large-.patch

0010-ENGR00295714-GPT-Status-register-bits-are-cleared-in.patch

0020-ENGR00300417-MX6SL-Add-100M-and-24M-MMDC-parameters-.patch

0021-ENGR00267929-MX6SL-Add-DDR3-support-for-MX6SL.patch

0034-ENGR00316182-imx6x-Fix-bugs-in-clock-and-cpufreq-cod.patch

0035-ENGR00316180-iMX6x-Support-IRAM-page-table-when-DDR-.patch

0037-ENGR00320182-iMX6DQ-iMX6DL-Fix-bug-in-DLL-off-mode-c.patch

0063-ENGR00331050-imx6qdl-Fix-the-workaround-for-ERR00577.patch

0068-ENGR00331611-imx6x-Set-SCU-CPU-Power-status-register.patch

0069-ENGR00327364-imx6x-Ensure-that-the-bandgap-self-bias.patch

0070-ENGR00334447-imx6qdl-Fix-random-failures-caused-by-d.patch

0071-MLK-9663-imx6x-Fix-the-DDR-self-refresh-entry-proced.patch

0072-MLK-9664-imx6x-Remove-unused-commented-code.patch

0075-MLK-9685-ARM-imx6x-Ensure-that-the-kernel-can-boot-w.patch

0076-MLK-9698-ARM-imx6x-Fix-build-break-when-CONFIG_SMP-i.patch

I am using the Variscite fork of the kernel with many patches of our own, so I will have to apply these patches carefully.

Best,

Dave

i.MX6Q Concurrent Capture Kernel Oops

i.MX6Q Concurrent Capture Kernel Oops

i.MX6Quad

Linux

Multimedia

Suspected Software Defect