AnsweredAssumed Answered

i.MX6Q Concurrent Capture Kernel Oops

Question asked by Dave McMordie on Oct 30, 2014
Latest reply on Nov 7, 2014 by Dave McMordie

We have a product based on using a Greyscale and a Bayer sensor (the first on CSI0, the second on a MIPI channel).  The system is based on 3.0.35, with extensive changes to enable greyscale and bayer capture, reserved kernel memory for capture and to fix a bug in the MIPI channel handling which assumed that the MIPI capture channel was always zero.  The changes are in a set of patches attached in another thread (Re: i.MX6 OV5647 Bayer sensor driver (ERROR: v4l2 capture: mxc_v4l_dqueue timeout enc_counter 0)


The system works pretty well.  However, we are occasionally (rarely) seeing a kernel oops in mxc_v4l2_capture mxc_v4l_dqueue:


BUG: soft lockup - CPU#2 stuck for 22s! [app:31294]
Modules linked in: ppp_deflate bsd_comp ppp_async crc_ccitt ppp_generic slhc tun bma250 mxc_v4l2_capture ipu_fg_overlay_sdc ipu_csi_enc ipu_prp_enc ipu_still ipu_bg_overlay_sdc option usb_wwan wl12xx_sdio usbserial mt9m031_camera ov5647_camera_mipi camera_sensor_clock

Pid: 31294, comm:                  app
CPU: 2    Not tainted  (3.0.35+ #107)
PC is at __raw_spin_lock_irqsave+0x84/0xa0
LR is at __raw_spin_lock_irqsave+0x54/0xa0
pc : [<805ae3cc>]    lr : [<805ae39c>]    psr: 200f0013
sp : a2603948  ip : 00000001  fp : 00000002
r10: 00000000  r9 : bfcc0800  r8 : 00000017
r7 : 00000001  r6 : a2602000  r5 : 00000001  r4 : 80c883a8
r3 : 00000001  r2 : 00000000  r1 : 00000000  r0 : 00000001
Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 10c53c7d  Table: 3330004a  DAC: 00000015
Kernel panic - not syncing: softlockup: hung tasks
[<8004f038>] (unwind_backtrace+0x0/0xf8) from [<805a608c>] (panic+0x64/0x184)
[<805a608c>] (panic+0x64/0x184) from [<800b7d20>] (watchdog_timer_fn+0x178/0x194)
[<800b7d20>] (watchdog_timer_fn+0x178/0x194) from [<8009e2c4>] (__run_hrtimer.isra.19+0x44/0xd8)
[<8009e2c4>] (__run_hrtimer.isra.19+0x44/0xd8) from [<8009ec44>] (hrtimer_interrupt+0xf8/0x274)
[<8009ec44>] (hrtimer_interrupt+0xf8/0x274) from [<8004da24>] (ipi_timer+0x28/0x2c)
[<8004da24>] (ipi_timer+0x28/0x2c) from [<800421a0>] (do_local_timer+0x5c/0x7c)
[<800421a0>] (do_local_timer+0x5c/0x7c) from [<800478cc>] (__irq_svc+0x4c/0xe8)
Exception stack(0xa2603900 to 0xa2603948)
3900: 00000001 00000000 00000000 00000001 80c883a8 00000001 a2602000 00000001
3920: 00000017 bfcc0800 00000000 00000002 00000001 a2603948 805ae39c 805ae3cc
3940: 200f0013 ffffffff
[<800478cc>] (__irq_svc+0x4c/0xe8) from [<805ae3cc>] (__raw_spin_lock_irqsave+0x84/0xa0)
[<805ae3cc>] (__raw_spin_lock_irqsave+0x84/0xa0) from [<803f0c28>] (ipu_clear_irq+0x24/0x8c)
[<803f0c28>] (ipu_clear_irq+0x24/0x8c) from [<8028e074>] (mxcfb_pan_display+0x324/0x530)
[<8028e074>] (mxcfb_pan_display+0x324/0x530) from [<80274ed0>] (fb_pan_display+0xbc/0x13c)
[<80274ed0>] (fb_pan_display+0xbc/0x13c) from [<80281f84>] (bit_update_start+0x14/0x30)
[<80281f84>] (bit_update_start+0x14/0x30) from [<8027f8e4>] (fbcon_switch+0x2a4/0x4a8)
[<8027f8e4>] (fbcon_switch+0x2a4/0x4a8) from [<802b1ff8>] (redraw_screen+0x14c/0x244)
[<802b1ff8>] (redraw_screen+0x14c/0x244) from [<8027ef3c>] (fbcon_blank+0x1b4/0x258)
[<8027ef3c>] (fbcon_blank+0x1b4/0x258) from [<802b3604>] (do_unblank_screen+0x94/0x17c)
[<802b3604>] (do_unblank_screen+0x94/0x17c) from [<802568b0>] (bust_spinlocks+0x24/0x4c)
[<802568b0>] (bust_spinlocks+0x24/0x4c) from [<8004b440>] (die+0xa8/0x294)
[<8004b440>] (die+0xa8/0x294) from [<805a543c>] (__do_kernel_fault.part.3+0x54/0x74)
[<805a543c>] (__do_kernel_fault.part.3+0x54/0x74) from [<80052d28>] (do_page_fault+0x2b4/0x31c)
[<80052d28>] (do_page_fault+0x2b4/0x31c) from [<80042338>] (do_DataAbort+0x34/0x9c)
[<80042338>] (do_DataAbort+0x34/0x9c) from [<80047850>] (__dabt_svc+0x70/0xa0)
Exception stack(0xa2603da0 to 0xa2603de8)
3da0: ba65c06c 73bbef20 00000068 a2603e5c a2603e48 00000000 ba65c000 ba65c008
3dc0: ba65c088 ba65c06c 600f0013 b9912140 7f1198d8 a2603de8 8007a3e0 7f11a760
3de0: 200f0093 ffffffff
[<80047850>] (__dabt_svc+0x70/0xa0) from [<7f11a760>] (mxc_v4l_do_ioctl+0xe88/0x2260 [mxc_v4l2_capture])
[<7f11a760>] (mxc_v4l_do_ioctl+0xe88/0x2260 [mxc_v4l2_capture]) from [<803ba970>] (video_usercopy+0x98/0x4cc)
[<803ba970>] (video_usercopy+0x98/0x4cc) from [<803b9f0c>] (v4l2_ioctl+0x11c/0x138)
[<803b9f0c>] (v4l2_ioctl+0x11c/0x138) from [<8010e4d4>] (do_vfs_ioctl+0x80/0x54c)
[<8010e4d4>] (do_vfs_ioctl+0x80/0x54c) from [<8010e9d8>] (sys_ioctl+0x38/0x5c)
[<8010e9d8>] (sys_ioctl+0x38/0x5c) from [<80047e80>] (ret_fast_syscall+0x0/0x30)
CPU3: stopping
[<8004f038>] (unwind_backtrace+0x0/0xf8) from [<800422dc>] (do_IPI+0x11c/0x144)
[<800422dc>] (do_IPI+0x11c/0x144) from [<800478cc>] (__irq_svc+0x4c/0xe8)
Exception stack(0xbff07f98 to 0xbff07fe0)
7f80:                                                       00000001 00000188
7fa0: 00000000 00000000 bff06000 80c67f84 805b042c 80bff394 1000406a 412fc09a
7fc0: 00000000 00000000 00000005 bff07fe0 800a9418 80048b30 200f0013 ffffffff
[<800478cc>] (__irq_svc+0x4c/0xe8) from [<80048b30>] (cpu_idle+0x38/0xfc)
[<80048b30>] (cpu_idle+0x38/0xfc) from [<105a23b4>] (0x105a23b4)
CPU1: stopping
[<8004f038>] (unwind_backtrace+0x0/0xf8) from [<800422dc>] (do_IPI+0x11c/0x144)
[<800422dc>] (do_IPI+0x11c/0x144) from [<800478cc>] (__irq_svc+0x4c/0xe8)
Exception stack(0xbffaff98 to 0xbffaffe0)
ff80:                                                       00000001 00000188
ffa0: 00000000 00000000 bffae000 80c67f84 805b042c 80bff394 1000406a 412fc09a
ffc0: 00000000 00000000 00000005 bffaffe0 800a9418 80048b30 200f0013 ffffffff
[<800478cc>] (__irq_svc+0x4c/0xe8) from [<80048b30>] (cpu_idle+0x38/0xfc)
[<80048b30>] (cpu_idle+0x38/0xfc) from [<105a23b4>] (0x105a23b4)
SMP: failed to stop secondary CPUs


Running under kgdb, I was able to trace this to a segfault due to a corrupted variable (frame) in the following:


 * Dequeue one V4L capture buffer
 * @param cam         structure cam_data *
 * @param buf         structure v4l2_buffer *
 * @return  status    0 success, EINVAL invalid frame number,
 *                    ETIME timeout, ERESTARTSYS interrupted by user
static int mxc_v4l_dqueue(cam_data *cam, struct v4l2_buffer *buf)
  int retval = 0;
  struct mxc_v4l_frame *frame;
  unsigned long lock_flags;

  pr_debug("In MVC:mxc_v4l_dqueue\n");

  if (!wait_event_interruptible_timeout(cam->enc_queue,
       cam->enc_counter != 0, 10 * HZ)) {
  pr_err("ERROR: v4l2 capture: mxc_v4l_dqueue timeout "
  "enc_counter %x\n",
  return -ETIME;
  } else if (signal_pending(current)) {
  pr_err("ERROR: v4l2 capture: mxc_v4l_dqueue() "
  "interrupt received\n");
  return -ERESTARTSYS;

  if (down_interruptible(&cam->busy_lock))
  return -EBUSY;

  spin_lock_irqsave(&cam->dqueue_int_lock, lock_flags);

  frame = list_entry(cam->, struct mxc_v4l_frame, queue);
  if (frame->buffer.flags & V4L2_BUF_FLAG_DONE) {
  frame->buffer.flags &= ~V4L2_BUF_FLAG_DONE;
  } else if (frame->buffer.flags & V4L2_BUF_FLAG_QUEUED) {
  pr_err("ERROR: v4l2 capture: VIDIOC_DQBUF: "
  "Buffer not filled.\n");
  frame->buffer.flags &= ~V4L2_BUF_FLAG_QUEUED;
  retval = -EINVAL;
  } else if ((frame->buffer.flags & 0x7) == V4L2_BUF_FLAG_MAPPED) {
  pr_err("ERROR: v4l2 capture: VIDIOC_DQBUF: "
  "Buffer not queued.\n");
  retval = -EINVAL;

  buf->bytesused = cam->v2f.fmt.pix.sizeimage;
  buf->index = frame->index;
  buf->flags = frame->buffer.flags;
  buf->m = cam->frame[frame->index].buffer.m;
  buf->timestamp = cam->frame[frame->index].buffer.timestamp;
  spin_unlock_irqrestore(&cam->dqueue_int_lock, lock_flags);

  return retval;


Essentially, the frame variable, as retrieved from the done_q, is invalid.  The bug can be made to happen only during severe loading of the system (2 x capture, encoding, all four cores pinned with processing) for hours at a time.  Other variables within this stack frame and above stack frames look fine, but the frame variable, although it appears to be a valid pointer, seems to point to garbage:



The kernel seems to always stop here and always with the same problem: frame is corrupt, everything else is okay.


So my questions are as follows:


  1. What could cause the frame variable to become corrupted?
  2. Is it possible that there is a concurrency problem between the two instances of the mxc_v4l2_capture module?
  3. What can I do to track down the source of the corruption of this variable?


Thanks in advance for any help offered.