IMX6Q system hang-up problem / linux kernel(3.0.35)

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

IMX6Q system hang-up problem / linux kernel(3.0.35)

14,638 Views
raoxd
Contributor II

hi

     imx6q use L3.0.35_4.1.0_130816 to build u-boot,kernel and rootfs.  when we run our app(qt app), the system hang-up when run 3 days. it happen serval times.

     in the kernel log you may find notifications like:

     [ 73.247613] INFO: rcu_preempt_state detected stalls on CPUs/tasks: { 0} (detected by 3, t=6002 jiffies)
     [   77.257610] INFO: rcu_bh_state detected stalls on CPUs/tasks: { 0} (detected by 3, t=6002 jiffies)
     [  253.567611] INFO: rcu_preempt_state detected stalls on CPUs/tasks: { 0} (detected by 3, t=24034 jiffies)
     [  257.577611] INFO: rcu_bh_state detected stalls on CPUs/tasks: { 0} (detected by 3, t=24034 jiffies)
     [  433.887610] INFO: rcu_preempt_state detected stalls on CPUs/tasks: { 0} (detected by 3, t=42066 jiffies)
     [  437.897611] INFO: rcu_bh_state detected stalls on CPUs/tasks: { 0} (detected by 3, t=42066 jiffies)
     [  614.207612] INFO: rcu_preempt_state detected stalls on CPUs/tasks: { 0} (detected by 3, t=60098 jiffies)
     [  618.217610] INFO: rcu_bh_state detected stalls on CPUs/tasks: { 0} (detected by 3, t=60098 jiffies)

     

    in community, i find someone encounter the same problem,according Re: Re: freescale android kernel have scheduling problem !!!,i follow steps below to patch my kernel.

    1.modify time.c

rcu_preempt_state detected stalls" could be fixed with the followng patch, please try

--- linux-3.0.35/arch/arm/plat-mxc/time.c.orig2013-12-11 09:25:29.518910709 +1300
+++ linux-3.0.35/arch/arm/plat-mxc/time.c2013-12-11 09:26:12.958912361 +1300

@@ -165,7 +165,8 @@

__raw_writel(tcmp, timer_base + V2_TCMP);

-return 0;

+        return evt < 0x7fffffff &&

+            (int)(tcmp - __raw_readl(timer_base + V2_TCN)) < 0 ? ETIME : 0;

}

#ifdef DEBUG

   2.patch 0027-ENGR00306276-iMX6-Add-workaround-for-ARM-errata-7613.patch, 0026-ENGR00306257-1027-fix-system-hang-up-issue-caused-by.patch and 0025-ENGR00295714-GPT-Status-register-bits-are-cleared-in.patch

   but when i finish modify,the kernel still hung up like before。 i also try to update gpu lib to p13,but i found lib(libEGL,libGAL) in gpu-viv-bin-mx6q-3.0.35-4.1.0 can not work in p13. is it p13 cannot used in linux?

   is there anyting  i can try to fix this problem in linux?

Labels (2)
34 Replies

4,306 Views
federicoloda
Contributor I

Hi we are using Linux3.0.35-4.1.0 + xenomai 2.6.4 on ARM Imx6 and we are facing a very similar problem:
after a couple of hour or some days the system freeze.

We check cat /proc/timer_list and we found the timer seems to miss an event

Timer List Version: v0.6
HRTIMER_MAX_CLOCK_BASES: 3
now at 105225868014163 nsecs

cpu: 0
 clock 0:
  .base:       80d05940
  .index:      0
  .resolution: 1 nsecs
  .get_time:   ktime_get
  .offset:     0 nsecs
active timers:
 #0: <80d05e60>, tick_sched_timer, S:01
 # expires at 104075730000000-104075730000000 nsecs [in -1150138014163 to -1150138014163 nsecs]
 #1: <8e771a80>, hrtimer_wakeup, S:01
 # expires at 104075736565408-104075736615408 nsecs [in -1150131448755 to -1150131398755 nsecs]
 #2: <8f001a80>, hrtimer_wakeup, S:01
 # expires at 104075764916408-104075764984514 nsecs [in -1150103097755 to -1150103029649 nsecs]
 #3: <8fae9a80>, hrtimer_wakeup, S:01
 # expires at 104077575167741-104077580167738 nsecs [in -1148292846422 to -1148287846425 nsecs]

......


Tick Device: mode:     1
Per CPU device: 0
Clock Event Device: mxc_timer1
 max_delta_ns:   4294967295
 min_delta_ns:   85000
 mult:           1
 shift:          0
 mode:           3
 next_event:     104075730000000 nsecs
 set_next_event: xnarch_next_htick_shot
 set_mode:       xnarch_switch_htick_mode
 event_handler:  hrtimer_interrupt
 retries:        0

The problem we suppose was the next event value is over the counter:

now at 105225868014163 nsecs

next_event:     104075730000000 nsecs
We notice also all periodic task freezes


We check also and all patches about timer.c problem was inside this kernel release

Can this problem be related to this thread ? anyone have a solution ?

0 Kudos

4,306 Views
sei-umemura
Contributor I

Hi rao and igorpadykov

Out Linux Kernel Version:

    Timesys LinuxLink 3.0.35-ts-armv71

    * base on [ L3.0.35_4.0.0 ]

    * we change the kernel setting about CONFIG_RCU_CPU_STALL_TIMEOUT from 30sec to 10sec .

CPU:i.MX6Q

I develop the apparatus image-processing using Linux and i.mx6q .

my system sometimes hang-up , the phenomenon that is written here .

The frequency operates 15 boards consecutively for two days and occurs in 1-2 boards.

I confirmed a few thing for the individual difference of the hardware .

when my system hang-up , my system may or may not output the log about rcu detected stall.

Follows are the log .

a)

[Thu Mar 05 08:14:11.496 2015] INFO: rcu_preempt_state detected stalls on CPUs/tasks: { 2 3} (detected by 0, t=1002 jiffies)

b)

Mar  5 06:16:09 (none) user.err kernel: INFO: rcu_preempt_state detected stalls on CPUs/tasks: { 1 3} (detected by 0, t=1002 jiffies)

c)

Mar  5 10:32:47 (none) user.err kernel: INFO: rcu_preempt_state detected stall on CPU 3 (t=1001 jiffies)

Mar  5 10:32:47 (none) user.err kernel: INFO: rcu_preempt_state detected stalls on CPUs/tasks: { 3} (detected by 2, t=1002 jiffies)

By the application that we develop, I produce nine child threads from main thread ,

and i watch the movement of the child thread in main thread.

when my system hang-up , either main thread and child thread may be hang-up .

if main thread hang-up, the watchdog timer outputs Power-on Reset .

So i cannot acquire useful log .

I traced child thread when hang-up , the thread did not wakeup after the sleep function call .

I set 1sec or 100ms for the sleep function , but the thread wakeup 10 sec later .

I implement a program to put up a stop flag of the global variable

when there is not a reply, it more than 10 seconds .

I felt that I linked about the wakeup and change of the global variable .

Question:

1) Has your system been hang-up without outputting log about rcu detected stall?

2) Could you tell me a cause that you think about this problem ?

0 Kudos

4,315 Views
PeterChan
NXP Employee
NXP Employee

Hi Rao xd,

Do you have any simple way to reproduce the "rcu_preempt_state detected stalls on CPUs/tasks" error on L3.0.35_4.1.0?

Hi Bateman Cai,

Unfortunately, Android 4.2.2 is based on L3.0.35_4.0.0.

0 Kudos

4,305 Views
raoxd
Contributor II

hi PeterChan:

     we can reproduce the problem when we run usb stress test . attach is our test tool which can run on imx6.

     we can see "rcu_preempt_state detected stalls on CPUs/tasks" error when we run these test one night.

     out test step is:

     1.attach u disk to imx6 usb interface,example mount /dev/sdb1 /mnt

     2.run usb test. 

         ./FileSysTest 2 /mnt 100000 3 &

         ./FileSysTest 2 /root 100000 3 500M &

     

       we can found that if cpu waitio Achieve 90% or high, the error can be easy reproduce.(waittio can be see by run "top").

    


0 Kudos

4,304 Views
PeterChan
NXP Employee
NXP Employee

Hi rao xd,

I see that your tool is using the i.MX6 usb host to read / write usb removable drive. In L3.0.35_4.1.0, there is a software bug that can make the usb host stop working. In order to ensure the hang-up is not caused by the usb host driver, would you please apply this usb memory patch for L3.0.35_4.1.0 and test this problem again?

Thanks,

Peter

0 Kudos

4,301 Views
raoxd
Contributor II

hi PeterChan:

   i update system to L3.0.101_4.1.1_141016.and run test again. The problem still persists.

   i force mx6q to run single core,the problem disappear.

  

0 Kudos

4,301 Views
PeterChan
NXP Employee
NXP Employee

Hi rao xd,

Unfortunately, I am not able to reproduce this issue on i.MX6Q in L3.0.101_4.1.1_141016 by running the test "./FileSysTest 2 /mnt/ 100000 3 500M &". Actually, the fix for the "rcu_preempt_state detected stall on CPU" are

1.

--- linux-3.0.35/arch/arm/plat-mxc/time.c.orig2013-12-11 09:25:29.518910709 +1300
+++ linux-3.0.35/arch/arm/plat-mxc/time.c2013-12-11 09:26:12.958912361 +1300

@@ -165,7 +165,8 @@

__raw_writel(tcmp, timer_base + V2_TCMP);

-return 0;

+        return evt < 0x7fffffff &&

+            (int)(tcmp - __raw_readl(timer_base + V2_TCN)) < 0 ? ETIME : 0;

}

#ifdef DEBUG

2.

patch 0027-ENGR00306276-iMX6-Add-workaround-for-ARM-errata-7613.patch, 0026-ENGR00306257-1027-fix-system-hang-up-issue-caused-by.patch and 0025-ENGR00295714-GPT-Status-register-bits-are-cleared-in.patch

and they are all merged into L3.0.101_4.1.1_141016.

Regards,

Peter

0 Kudos

4,301 Views
asim_zaidi
NXP Employee
NXP Employee

Hi raoxd

Can you kindly provide an update on your testing. Do you still observe the issue with the patches provided by PeterChan

Thanks

Asim

0 Kudos

4,301 Views
raoxd
Contributor II

i has update to 3.0.101 and run on freescale SDB board.but The problem still persists.

the test include(test run at the same time)

1.emmc stress test

2.network stress

3.memtest

we found that the problem can be easy appear when cpu watiio is high(in linux,run top,can see x.x%wa is high).

0 Kudos

3,101 Views
jobs
Contributor III

This is my problem, and it comes up when traffic pressure increases.

MPC85XX T2080 E6500 RCU检测CPU X卡顿 - 恩智浦社区 (nxp.com)

0 Kudos

4,305 Views
asim_zaidi
NXP Employee
NXP Employee

Hi

Can you kindly provide more details on the steps to reproduce the issue. We are not observing this issue in our testing with 3.0.101.


Can you also try enabling the Performance governor


Thanks


Asim

0 Kudos

4,305 Views
raoxd
Contributor II

attach is test tool.

1、memtest

can run memTest --?for help

we run ./memTest 100M 1000000

2、FileSysTest

attach u disk to imx6 usb interface,example mount /dev/sdb1 /mnt

run usb test.

./FileSysTest 2 /mnt 100000 3 &

./FileSysTest 2 /root 100000 3 500M &


3、NetStress Test

use two device for test. one for client ,the other for service.( NetCntTest ,test_NetCnt_stress.sh for linuxnetCntTest.exetest_netCnt.bat for windows)

unzip NetCnt.rar .

client:

./NetCntTest  deviceType ScrIp DstIp DstPort packetlen

devictType: 0 for client, 1 for service

for example ./NetCntTest 0 172.17.179.4 172.17.0.61 20001 10000.

(20001 is dst port)

service:

./NetCntTest  deviceType ScrIp DstIp ScrPort packetlen

for example

./NetCntTest 1 172.17.0.61 20001 10000

(20001 is local port)


step1,2,3 run at the same time。


PS:i still enable PCIE in SDB which connect to network card。

0 Kudos

4,305 Views
ko-hey
Senior Contributor II

Dear raoxd

Please let me confirm your progress.

My customer has same issue and error messages.

The message that my customer received is below.

INFO: rcu_preempt_state detected stalls on CPUs/tasks: { 2 3} (detected by 0, t=1002 jiffies)

Finally, could you resolve the issue ?

If yes, please share what did you do to resolve it.

Best Regards,

Ko-hey

0 Kudos

4,308 Views
raoxd
Contributor II

no,the problem still persists.

i have provide test tools,and wating for freescale for a reply.

0 Kudos

4,311 Views
ko-hey
Senior Contributor II

Hi rao xd

I was expecting that it has already resolved…

Did you contact to freescale's FAE in other way ?

Ko-hey

0 Kudos

4,311 Views
jianzhu
Contributor I

Hi Ko-hey

How you solve it? We have met the similar problem. But we do not have log output on serial port. The system hangup. We can not login with serial port or network. Our kernel version is 3.0.35_4.1.0 (Yocto 1.5 Dora).

BR

Jerry

0 Kudos

4,318 Views
jay0725jay
Contributor I

Hi,PeterChan :

    1.  I also meet the same problem which patch 0027-ENGR00306276-iMX6-Add-workaround-for-ARM-errata-7613.patch, 0026-ENGR00306257-1027-fix-system-hang-up-issue-caused-by.patch and 0025-ENGR00295714-GPT-Status-register-bits-are-cleared-in.patch and 0009-ARM-imx-return-zero-in-case-next-event-gets-a-large-.patch.

   2. and  if I config the configure with "# LOCAL_TIMERS is not set", the problem seems solved..  but if i config the configure with "LOCAL_TIMERS=y" the system hung up after few days again.

   3. if "LOCAL_TIMERS is not set",  it can influence the data of our app comunicate to imx6q by  USB delay or loss.

by the way, my kernel is 3.0.35_4.0.0,android4.2.2,imx6.

  is there anyting  i can try to fix this problem in linux?  thanks!

0 Kudos

4,320 Views
igorpadykov
NXP Employee
NXP Employee

Hi rao

please try attached patches.

There are gpu fixes in latest upgrade 3.10.17-1.0.1 and

is planned L3.10.17 1.0.2 by the end of Oct.

Freescale OpenEmbedded/Yocto Layers discussion list ()

Best regards

igor

0 Kudos

4,320 Views
raoxd
Contributor II

Hi igorpadykov

   the patch you provided can not suitable for Linux3.0.35_4.1.0 when i patch the file, it prompt patch fail.the below is unsucceed patch.

   0002-ENGR00316180-iMX6x-Support-IRAM-page-table-when-DDR-.patch

   0035-ENGR00334447-imx6qdl-Fix-random-failures-caused-by-d.patch

   can you provide patch that suitable for Linux3.0.35_4.1.0? or can you tell me what linux version you used? thanks.

0 Kudos

4,320 Views
igorpadykov
NXP Employee
NXP Employee

patch will be released on next week on official i.MX6 site.

0 Kudos