i.MX6Q Concurrent Capture Kernel Oops

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

i.MX6Q Concurrent Capture Kernel Oops

1,928 Views
davemcmordie
Contributor III

We have a product based on using a Greyscale and a Bayer sensor (the first on CSI0, the second on a MIPI channel).  The system is based on 3.0.35, with extensive changes to enable greyscale and bayer capture, reserved kernel memory for capture and to fix a bug in the MIPI channel handling which assumed that the MIPI capture channel was always zero.  The changes are in a set of patches attached in another thread (Re: i.MX6 OV5647 Bayer sensor driver (ERROR: v4l2 capture: mxc_v4l_dqueue timeout enc_counter 0)

The system works pretty well.  However, we are occasionally (rarely) seeing a kernel oops in mxc_v4l2_capture mxc_v4l_dqueue:

BUG: soft lockup - CPU#2 stuck for 22s! [app:31294]

Modules linked in: ppp_deflate bsd_comp ppp_async crc_ccitt ppp_generic slhc tun bma250 mxc_v4l2_capture ipu_fg_overlay_sdc ipu_csi_enc ipu_prp_enc ipu_still ipu_bg_overlay_sdc option usb_wwan wl12xx_sdio usbserial mt9m031_camera ov5647_camera_mipi camera_sensor_clock

Pid: 31294, comm:                  app

CPU: 2    Not tainted  (3.0.35+ #107)

PC is at __raw_spin_lock_irqsave+0x84/0xa0

LR is at __raw_spin_lock_irqsave+0x54/0xa0

pc : [<805ae3cc>]    lr : [<805ae39c>]    psr: 200f0013

sp : a2603948  ip : 00000001  fp : 00000002

r10: 00000000  r9 : bfcc0800  r8 : 00000017

r7 : 00000001  r6 : a2602000  r5 : 00000001  r4 : 80c883a8

r3 : 00000001  r2 : 00000000  r1 : 00000000  r0 : 00000001

Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user

Control: 10c53c7d  Table: 3330004a  DAC: 00000015

Kernel panic - not syncing: softlockup: hung tasks

[<8004f038>] (unwind_backtrace+0x0/0xf8) from [<805a608c>] (panic+0x64/0x184)

[<805a608c>] (panic+0x64/0x184) from [<800b7d20>] (watchdog_timer_fn+0x178/0x194)

[<800b7d20>] (watchdog_timer_fn+0x178/0x194) from [<8009e2c4>] (__run_hrtimer.isra.19+0x44/0xd8)

[<8009e2c4>] (__run_hrtimer.isra.19+0x44/0xd8) from [<8009ec44>] (hrtimer_interrupt+0xf8/0x274)

[<8009ec44>] (hrtimer_interrupt+0xf8/0x274) from [<8004da24>] (ipi_timer+0x28/0x2c)

[<8004da24>] (ipi_timer+0x28/0x2c) from [<800421a0>] (do_local_timer+0x5c/0x7c)

[<800421a0>] (do_local_timer+0x5c/0x7c) from [<800478cc>] (__irq_svc+0x4c/0xe8)

Exception stack(0xa2603900 to 0xa2603948)

3900: 00000001 00000000 00000000 00000001 80c883a8 00000001 a2602000 00000001

3920: 00000017 bfcc0800 00000000 00000002 00000001 a2603948 805ae39c 805ae3cc

3940: 200f0013 ffffffff

[<800478cc>] (__irq_svc+0x4c/0xe8) from [<805ae3cc>] (__raw_spin_lock_irqsave+0x84/0xa0)

[<805ae3cc>] (__raw_spin_lock_irqsave+0x84/0xa0) from [<803f0c28>] (ipu_clear_irq+0x24/0x8c)

[<803f0c28>] (ipu_clear_irq+0x24/0x8c) from [<8028e074>] (mxcfb_pan_display+0x324/0x530)

[<8028e074>] (mxcfb_pan_display+0x324/0x530) from [<80274ed0>] (fb_pan_display+0xbc/0x13c)

[<80274ed0>] (fb_pan_display+0xbc/0x13c) from [<80281f84>] (bit_update_start+0x14/0x30)

[<80281f84>] (bit_update_start+0x14/0x30) from [<8027f8e4>] (fbcon_switch+0x2a4/0x4a8)

[<8027f8e4>] (fbcon_switch+0x2a4/0x4a8) from [<802b1ff8>] (redraw_screen+0x14c/0x244)

[<802b1ff8>] (redraw_screen+0x14c/0x244) from [<8027ef3c>] (fbcon_blank+0x1b4/0x258)

[<8027ef3c>] (fbcon_blank+0x1b4/0x258) from [<802b3604>] (do_unblank_screen+0x94/0x17c)

[<802b3604>] (do_unblank_screen+0x94/0x17c) from [<802568b0>] (bust_spinlocks+0x24/0x4c)

[<802568b0>] (bust_spinlocks+0x24/0x4c) from [<8004b440>] (die+0xa8/0x294)

[<8004b440>] (die+0xa8/0x294) from [<805a543c>] (__do_kernel_fault.part.3+0x54/0x74)

[<805a543c>] (__do_kernel_fault.part.3+0x54/0x74) from [<80052d28>] (do_page_fault+0x2b4/0x31c)

[<80052d28>] (do_page_fault+0x2b4/0x31c) from [<80042338>] (do_DataAbort+0x34/0x9c)

[<80042338>] (do_DataAbort+0x34/0x9c) from [<80047850>] (__dabt_svc+0x70/0xa0)

Exception stack(0xa2603da0 to 0xa2603de8)

3da0: ba65c06c 73bbef20 00000068 a2603e5c a2603e48 00000000 ba65c000 ba65c008

3dc0: ba65c088 ba65c06c 600f0013 b9912140 7f1198d8 a2603de8 8007a3e0 7f11a760

3de0: 200f0093 ffffffff

[<80047850>] (__dabt_svc+0x70/0xa0) from [<7f11a760>] (mxc_v4l_do_ioctl+0xe88/0x2260 [mxc_v4l2_capture])

[<7f11a760>] (mxc_v4l_do_ioctl+0xe88/0x2260 [mxc_v4l2_capture]) from [<803ba970>] (video_usercopy+0x98/0x4cc)

[<803ba970>] (video_usercopy+0x98/0x4cc) from [<803b9f0c>] (v4l2_ioctl+0x11c/0x138)

[<803b9f0c>] (v4l2_ioctl+0x11c/0x138) from [<8010e4d4>] (do_vfs_ioctl+0x80/0x54c)

[<8010e4d4>] (do_vfs_ioctl+0x80/0x54c) from [<8010e9d8>] (sys_ioctl+0x38/0x5c)

[<8010e9d8>] (sys_ioctl+0x38/0x5c) from [<80047e80>] (ret_fast_syscall+0x0/0x30)

CPU3: stopping

[<8004f038>] (unwind_backtrace+0x0/0xf8) from [<800422dc>] (do_IPI+0x11c/0x144)

[<800422dc>] (do_IPI+0x11c/0x144) from [<800478cc>] (__irq_svc+0x4c/0xe8)

Exception stack(0xbff07f98 to 0xbff07fe0)

7f80:                                                       00000001 00000188

7fa0: 00000000 00000000 bff06000 80c67f84 805b042c 80bff394 1000406a 412fc09a

7fc0: 00000000 00000000 00000005 bff07fe0 800a9418 80048b30 200f0013 ffffffff

[<800478cc>] (__irq_svc+0x4c/0xe8) from [<80048b30>] (cpu_idle+0x38/0xfc)

[<80048b30>] (cpu_idle+0x38/0xfc) from [<105a23b4>] (0x105a23b4)

CPU1: stopping

[<8004f038>] (unwind_backtrace+0x0/0xf8) from [<800422dc>] (do_IPI+0x11c/0x144)

[<800422dc>] (do_IPI+0x11c/0x144) from [<800478cc>] (__irq_svc+0x4c/0xe8)

Exception stack(0xbffaff98 to 0xbffaffe0)

ff80:                                                       00000001 00000188

ffa0: 00000000 00000000 bffae000 80c67f84 805b042c 80bff394 1000406a 412fc09a

ffc0: 00000000 00000000 00000005 bffaffe0 800a9418 80048b30 200f0013 ffffffff

[<800478cc>] (__irq_svc+0x4c/0xe8) from [<80048b30>] (cpu_idle+0x38/0xfc)

[<80048b30>] (cpu_idle+0x38/0xfc) from [<105a23b4>] (0x105a23b4)

SMP: failed to stop secondary CPUs

Running under kgdb, I was able to trace this to a segfault due to a corrupted variable (frame) in the following:

/*!

* Dequeue one V4L capture buffer

*

* @param cam         structure cam_data *

* @param buf         structure v4l2_buffer *

*

* @return  status    0 success, EINVAL invalid frame number,

*                    ETIME timeout, ERESTARTSYS interrupted by user

*/

static int mxc_v4l_dqueue(cam_data *cam, struct v4l2_buffer *buf)

{

  int retval = 0;

  struct mxc_v4l_frame *frame;

  unsigned long lock_flags;

  pr_debug("In MVC:mxc_v4l_dqueue\n");

  if (!wait_event_interruptible_timeout(cam->enc_queue,

       cam->enc_counter != 0, 10 * HZ)) {

  pr_err("ERROR: v4l2 capture: mxc_v4l_dqueue timeout "

  "enc_counter %x\n",

        cam->enc_counter);

  return -ETIME;

  } else if (signal_pending(current)) {

  pr_err("ERROR: v4l2 capture: mxc_v4l_dqueue() "

  "interrupt received\n");

  return -ERESTARTSYS;

  }

  if (down_interruptible(&cam->busy_lock))

  return -EBUSY;

  spin_lock_irqsave(&cam->dqueue_int_lock, lock_flags);

  cam->enc_counter--;

  frame = list_entry(cam->done_q.next, struct mxc_v4l_frame, queue);

  list_del(cam->done_q.next);

  if (frame->buffer.flags & V4L2_BUF_FLAG_DONE) {

  frame->buffer.flags &= ~V4L2_BUF_FLAG_DONE;

  } else if (frame->buffer.flags & V4L2_BUF_FLAG_QUEUED) {

  pr_err("ERROR: v4l2 capture: VIDIOC_DQBUF: "

  "Buffer not filled.\n");

  frame->buffer.flags &= ~V4L2_BUF_FLAG_QUEUED;

  retval = -EINVAL;

  } else if ((frame->buffer.flags & 0x7) == V4L2_BUF_FLAG_MAPPED) {

  pr_err("ERROR: v4l2 capture: VIDIOC_DQBUF: "

  "Buffer not queued.\n");

  retval = -EINVAL;

  }

  buf->bytesused = cam->v2f.fmt.pix.sizeimage;

  buf->index = frame->index;

  buf->flags = frame->buffer.flags;

  buf->m = cam->frame[frame->index].buffer.m;

  buf->timestamp = cam->frame[frame->index].buffer.timestamp;

  spin_unlock_irqrestore(&cam->dqueue_int_lock, lock_flags);

  up(&cam->busy_lock);

  return retval;

}

Essentially, the frame variable, as retrieved from the done_q, is invalid.  The bug can be made to happen only during severe loading of the system (2 x capture, encoding, all four cores pinned with processing) for hours at a time.  Other variables within this stack frame and above stack frames look fine, but the frame variable, although it appears to be a valid pointer, seems to point to garbage:

FramePointerGarbaged.png

The kernel seems to always stop here and always with the same problem: frame is corrupt, everything else is okay.

So my questions are as follows:

  1. What could cause the frame variable to become corrupted?
  2. Is it possible that there is a concurrency problem between the two instances of the mxc_v4l2_capture module?
  3. What can I do to track down the source of the corruption of this variable?

Thanks in advance for any help offered.


Dave

2 Replies

526 Views
igorpadykov
NXP Employee
NXP Employee

Hi Dave

you can try recent bug fixes for L3.0.35_4.1.0 provided below

IMX6Q system hang-up problem /  linux kernel(3.0.35)

L3.0.101_4.1.1_SOURCE_BSP : i.MX 6Quad, i.MX 6Dual, i.MX 6DualLite, i.MX 6Solo and i.MX 6Sololite Linux BSP Source Code

Best regards

igor

0 Kudos

527 Views
davemcmordie
Contributor III

Thanks Igor,

I will give these fixes a try and let you know.  To confirm, the patches I will try to apply are:

0001-ENGR00276652-ARM-mx6-Use-IOMUXC-irq-instead-of-125.patch

0009-ARM-imx-return-zero-in-case-next-event-gets-a-large-.patch

0010-ENGR00295714-GPT-Status-register-bits-are-cleared-in.patch

0020-ENGR00300417-MX6SL-Add-100M-and-24M-MMDC-parameters-.patch

0021-ENGR00267929-MX6SL-Add-DDR3-support-for-MX6SL.patch

0034-ENGR00316182-imx6x-Fix-bugs-in-clock-and-cpufreq-cod.patch

0035-ENGR00316180-iMX6x-Support-IRAM-page-table-when-DDR-.patch

0037-ENGR00320182-iMX6DQ-iMX6DL-Fix-bug-in-DLL-off-mode-c.patch

0063-ENGR00331050-imx6qdl-Fix-the-workaround-for-ERR00577.patch

0068-ENGR00331611-imx6x-Set-SCU-CPU-Power-status-register.patch

0069-ENGR00327364-imx6x-Ensure-that-the-bandgap-self-bias.patch

0070-ENGR00334447-imx6qdl-Fix-random-failures-caused-by-d.patch

0071-MLK-9663-imx6x-Fix-the-DDR-self-refresh-entry-proced.patch

0072-MLK-9664-imx6x-Remove-unused-commented-code.patch

0075-MLK-9685-ARM-imx6x-Ensure-that-the-kernel-can-boot-w.patch

0076-MLK-9698-ARM-imx6x-Fix-build-break-when-CONFIG_SMP-i.patch

I am using the Variscite fork of the kernel with many patches of our own, so I will have to apply these patches carefully.

Best,


Dave

0 Kudos