imx6sx, kernel hangs at rootfs mount

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

imx6sx, kernel hangs at rootfs mount

6,365 Views
angelo_dureghel
Contributor III

Hi all,

i am working on a 4.14.113 kernel, applying all patches from freescale linux-imx 4.14.78_1.0.0_ga_var01, working on a custom imx6sx board similar to sabresd.

I am experiencing a total kernel hangs around the ext4 rootfs mount. There are no messages useful for debug, no initcall is ongoing, no oops messages, and the freeze point may vary in time, depending if i power cycle fully the board or if i just reset by pwr reset button. If i set a timer in a driver probe, it also stops so the kernel totally freeze.

I finally found out the the issue is caused by CONFIG_CPU_IDLE. Removing it, boot process completes properly, and i reach the prompt. Also, rolling back to kernel 4.9, it works properly with CONFIG_CPU_IDLE enabled.

Considering CONFIG_CPU_IDLE related code may not be the real cause, and that i could have some hw issues too, i am wondering if you know of any imx6sx CONFIG_CPU_IDLE related issue, maybe fixed after 4.14, or if you may know of any similar issue.

Thanks

angelo

Labels (1)
20 Replies

5,599 Views
angelo_dureghel
Contributor III

Hi SImon,

thanks for your feedbacks, i used now

prep_anatop_bypass();
set_anatop_bypass(1);
finish_anatop_bypass();

//udelay(100);
printf("LDO %08x\n", *((unsigned int *)0x20C8140));

LDO 007c001f

It is enabled. Also enabled it in the device tree adding


&reg_arm {
vin-supply = <&sw1a_reg>;
regulator-allow-bypass;
};

&reg_soc {
vin-supply = <&sw1a_reg>;
regulator-allow-bypass;
};

At least on our case we have no benefits from this.

Thanks

angelo

0 Kudos
Reply

5,600 Views
angelo_dureghel
Contributor III

Hi Simon and Igor,

Igor,

does the fact that cpu hangs entering low power mode at wait (wfi instruction, and gating clocks) gives you any idea ? Every hint is welcome.

Simon,

thanks for your tip, tried but it gives not benefits.

Btw, does LDO bypass (fsl,ldo-bypass=<1>;) means that CPU core and SoC power supplies (as VDDARM_IN VDDSOC_IN) need to be provided from externals LDO or PMIC ?  If yes, since we have also a board without pmic, i don't think it suits our case.

I did some other tests:

setting "Timer tick handling" to "Periodic timer ticks (constant rate, no dynticks)" , end putting a 10msec before

entering low power /idle, all works. This is just a test of course, but curious thing is that it only works with the mdelay "before" entering idle (wfi).

Looking kernel code for msleep it results in a cpu_relax() that seems to result in a "barrier()". Btw, it is clear that applying some "relax" to the cpu before entering LPM/idle is allowing the cpu to enter wait mode properly.

Also,

- we have here 2 boards, issue is exactly the same on the 2 boards, so would exclude soldering / manufacturing issues.

- we have another model of board with exactly same imx6sx cpu and it is not hanging

- another curious thing, i tried kernel 5.2.0 from mainline with a minimal dts, cpu hangs in the same way

So this makes me think there is some hw design issue in this board model only, as conflicting pins, or not sufficient/stable supply current, or something to check in VCAP capacitors / voltages.

For now, my best fix seems still disabling CONFIG_CPU_IDLE. At my opinion, consumption increment should be minimal, but i cannot find appropriate measurements on this. Any reference is welcome.

Regards,

angelo

5,602 Views
simonlocke
Contributor III

Angelo,

We've arrived at much the same conclusion. We also have several boards, all the same, which have the same behaviour with this version of the kernel. I have seen changes in the behaviour of the lock up depending on the delay before entering do_idle. We also are working around the problem (of booting with LDOs enabled) by applying a kernel patch to make the imx6sx_enter_wait function just return index.

Ref: https://source.codeaurora.org/external/imx/linux-imx/tree/arch/arm/mach-imx/cpuidle-imx6sx.c?h=imx_4...https://source.codeaurora.org/external/imx/linux-imx/tree/arch/arm/mach-imx/cpuidle-imx6sx.c?h=imx_4...

We have then found that by booting with LDOs enabled and no CPU do_idles happening, the processor seems to work well. Although it is using more power than it would otherwise do. We then do not have any issues with PCIe or USB enumeration. However we do still have an issue with M4 co-processor booting, so that may well be a different issue.

The LDO bypass mode is useful when you are not using a PMIC and the voltage you are supplying is not high enough to allow the CPU to voltage/frequency scale correctly. See Ref Manual Section 48, and Data sheet section 4.1.3. Be careful though, as with no LDO your VDD_ARM_CAP and VDD_SOC_CAP could exceed specification.

I'm slightly surprised that it did not allow your board to boot by enabling LDO bypass. There are several parts to it though. Firstly you must configure the mode from uboot, because uboot reads your kernel device tree before it boots the kernel. Look at the sabre dev board example here:

https://source.codeaurora.org/external/imx/uboot-imx/tree/board/freescale/mx6sxsabresd/mx6sxsabresd....

The value passed into this function is true if the fsl,ldo-bypass=<1> in your kernel device tree. Your uboot code must call  prep_anatop_bypass(), set_anatop_bypass(1) and finish_anatop_bypass(). This switches the LDO into bypass mode just before it boots the kernel.

Next you may also need the enable-bypass options on the core regulators in your device tree:

https://source.codeaurora.org/external/imx/linux-imx/tree/arch/arm/boot/dts/imx6sx-sdb.dts?h=imx_4.1...

However, be aware of my comments from above. By enabling LDO bypass we caused ourselves other issues; like PCIe not working. However, it may work OK for you, depending on your set up.

0 Kudos
Reply

5,602 Views
igorpadykov
NXP Employee
NXP Employee

Hi Angelo

"putting a 10msec before entering low power /idle, all works" may point to

power supply (LDO) timing issues, similar to ERR005852 i.MX6Q and attached old patch

errata decription for ldo ramp times. Before executing wfi  (or exit) processor supplies

should be in accordance to "Standby/DSM Mode" Table 10. Operating Ranges Datasheet

for specific i.MX6SX part. So one can check and measure with oscilloscope VDD_ARM_IN (CAP),

VDD_SOC_IN(CAP), so for example voltage is lowered to ""Standby Mode" but processor

still executes some thread in normal mode. The same for exit procedure. In hardware one

can check capacitors on VDDHIGH_CAP, NVCC_PLL_OUT, VDD_ARM_CAP, VDD_SOC_CAP

 - are they in accordance to latest i.MX6SX Hardware Guide

https://www.nxp.com/webapp/Download?colCode=IMX6SXHDG 

Best regards
igor

5,602 Views
angelo_dureghel
Contributor III

Hi Igor and Simon,

Igor,

many thanks, will check those voltages.

Simon,

i have an older 2016 u-boot pre-dts, so i now enabled LDO bypass from board_init

unsigned int volatile *p = (unsigned int volatile *)0x20C8140;

for (x = 0; x < 4; ++x) {
   reg = *p;
   reg |= 0x1f;
   reg |= (0x1f << 18);
   *p++ = reg;
}

LDO bypass should be enabled, but kernel still hangs later on at the same point.

0 Kudos
Reply

5,602 Views
simonlocke
Contributor III

Hi Angelo,

Looks like maybe its not working. However, I'm not sure you should be modifying all four of the control regs in sequence (Value, Set, Clear & Toggle). You probably only want to write to the first register, which is the 'value', as opposed to the set, clear & toggle.

I appreciate about an older uboot. We were using a 2015-04 until recently when we moved our whole code base up to the latest. However, looking in that version I have found the relevant functions:

https://source.codeaurora.org/external/imx/uboot-imx/tree/arch/arm/cpu/armv7/mx6/soc.c?h=nxp/imx_v20...

This shows the 3 functions to switch over to LDO mode safely. Presumably, there should also be the ldo_mode_set function in your board specific source code? You could just make that call the three LDO bypass functions directly.

We will in the background continue to research this issue, and do hope to fully resolve it. Any more info I get on it I will post here.

Anyway, we now have a work around for our first 3 issues, and we've found the solution to the 4th issue, which I am posting for other people to know about:

https://community.nxp.com/message/1169957

0 Kudos
Reply

5,602 Views
simonlocke
Contributor III

Hi Angelo,

I have sympathy for your frustration!

We've been investigating some very similar issues with a custom iMX6SX board, where we are using 4.14.78. We previously used 4.1.15 with no issues on a very similar board. Now we want to move to a later kernel and have had 4 different issues that cause the A9 Linux to freeze. They are:

  1. Booting with LDO bypass mode not set (we are using a direct power supply, no controlled PMIC)
  2. Enumerating a driver to a PCIe device
  3. Enumerating a driver to a USB device
  4. Booting the M4 coprocessor, then booting Linux

We have also noted that it appears to potentially be something to do with CPU_FREQ or CPU_IDLE, but I have recently noted the following issue: https://community.nxp.com/thread/440575 which references ERR009572 and have begun wondering if this is anything to do with it, as our silicon does appear to be in the matching date code range.

I'm afraid I cannot suggest anything to help, as we are pretty stumped about this issue. I think I may try looking at some of the things you suggest above though.

0 Kudos
Reply

5,602 Views
angelo_dureghel
Contributor III

Hi Simon,

thanks, happy to share the similar issue, but maybe mine, very similar, seems niot the same. I havea  v.1.3 so should not be involved in ERR009572

This is my current analysis, as a facts list:

1) system hang happens while or after entering "wfi" assembly instruction (entering low power mode)

2) system hangs is _not_ happening on same custom imx6sx-based board with kernel 4.9.175,

3) system hangs is happening on same hardware with kernel 4.14.78 (or 113) even using same dtb and same config of 4.9,

4) disabled nearly all the devices from device tree, does not change anything, system hang is still there,

5) setting "Timer tick handling" to "Periodic timer ticks (constant rate, no dynticks)", then the issue seems related to the frequency of entering/exting idle (cpu_do_idle()) since adding some printk or a 20ms delay before entering low power mode, the system works,

6) on a similar custom board, imx6sx-based, with kernel 4.14.78 (or 113), the issue is not manifesting,

7) with kernel 4.14.78, pre-starting and loading M4 FW from u-boot or not, does not change anything,

8) a system hang after "wfi" instruction seems really to be connected to ERR007265 but the fix for this is there for a long time before 4.14 and still the same in the mainline.

9) both imx6sx board models are not using PMIC but simpler LDO-based PS.

10) disabling CCM CLPCR BM_CLPCR_ARM_CLK_DIS_ON_LPM bit entering idle, issue disappears, but system seems not stable (getting some oops related to memory). 

So, due to the fact that the frequency of entering/exiting idle matter, one theory is that this issue is related to cpu power supply design. Maybe in kernel 4.9 the frequency of entering idle is lower, will try to measure this generating a signal on gpio.

Also, will try to connect by usb to execute ddr3 tiuning, maybe ddr3 init sequence is not perfect for this board.

Will let you know.

5,602 Views
simonlocke
Contributor III

Hi again Angelo & Igor,

I have now successfully narrowed my 1st issue I described down to the SAME point that Angelo describes. The ARM7 freezes on the call to cpu_do_idle, which according to the built system.map, tallies with the wfi instruction in arch/arm/mm/proc-v7.S assember ENTRY(cpu_v7_do_idle). The behaviour of how quickly it ends up frozen is affected by how much trace I put into the build. This again appears to tally with Angelo's results.

Angelo, the way we found to get around this boot stopping issue was to set the LDO bypass in the device tree; which is done by setting "fsl,ldo-bypass=<1>;" on the gpc node. However, this is not a valid work around for us, as it then seems to mean that we have the 3 other issues I described earlier.

Igor, do you or any other colleagues have any further thoughts on this one?

0 Kudos
Reply

5,602 Views
simonlocke
Contributor III

Thanks Angelo,

I think we've ruled ERR009572 out now as we have confirmed we do have the PCIe Enabled variant of the device (based on the part number). So that was a false lead.

However, generally speaking some of what you describe does sound very similar in principle to our issue in that the processor hangs and we have been able to get it down to being something to do with sleep mode. I will try changing the frequency of entering/exiting idle as you suggest to see if that has an effect as you describe.

Also, we have run the DDR3L calibration tool on the board to give us correct DDR settings, and this has been built into uboot. Doing this did not change anything for us.

0 Kudos
Reply

5,602 Views
angelo_dureghel
Contributor III

Hi Igor,

thanks for the support.

I tested linux-imx-4.14.98_2.0.0_ga, boot fails at the same way.

[ 2.606218] mmc1: host does not support reading read-only switch, assuming write-enable
[ 2.621992] mmc1: new high speed SDHC card at address e624
[ 2.629370] mmcblk1: mmc1:e624 SL08G 7.40 GiB
[ 2.638338] mmcblk1: p1 p2
[ 2.685734] vf610-adc 2280000.adc: Timeout for adc calibration
[ 2.702368] fsl-asrc 2034000.asrc: failed to get spba clock
[ 2.712155] fsl-ssi-dai 202c000.ssi: No cache defaults, reading back from HW
[ 2.728317] NET: Registered protocol family 10
[ 2.736145] Segment Routing with IPv6
[ 2.740085] sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
[ 2.747842] NET: Registered protocol family 17
[ 2.752357] can: controller area network core (rev 20170425 abi 9)
[ 2.758828] NET: Registered protocol family 29
[ 2.763312] can: raw protocol (rev 20170425)
[ 2.767817] can: broadcast manager protocol (rev 20170425 t)
[ 2.773534] can: netlink gateway (rev 20170425) max_hops=1
[ 2.779732] Key type dns_resolver registered
[ 2.818704] Registering SWP/SWPB emulation handler
[ 2.881997] A9-M4 sema4 num 6, A9-M4 magic number 0x12345678 - 0x7ed17773.
[ 2.890141] genpd_xlate_onecell: invalid domain index 2
[ 2.896870] genpd_xlate_onecell: invalid domain index 2
[ 2.905122] (NULL device *): hwmon_device_register() is deprecated. Please convert the driver to use hwmon_device_register_with_info().
[ 2.918771] imx_thermal 2000000.aips-bus:tempmon: Industrial CPU temperature grade - max:105C critical:100C passive:95C
[ 2.931360] genpd_xlate_onecell: invalid domain index 2
[ 2.937863] genpd_xlate_onecell: invalid domain index 2
[ 2.946213] snvs_rtc 20cc000.snvs:snvs-rtc-lp: setting system clock to 1970-01-01 00:32:32 UTC (1952)
[ 2.957696] usb_otg1_vbus: disabling
[ 2.961398] backlight-pwr: disabling
[ 2.965000] PSU-5V0: disabling
[ 2.968256] can-en: disabling
[ 2.971256] can-stby: disabling
[ 2.974423] ALSA device list:
[ 2.977467] No soundcards found.
[ 2.993011] entering cpu idle
[ 2.996015] exiting cpu idle
[ 3.000387] entering cpu idle
[ 3.003390] exiting cpu idle
[ 3.010460] entering cpu idle
[ 3.013463] exiting cpu idle
[ 3.017413] entering cpu idle
[ 3.020441] exiting cpu idle
[ 3.023976] entering cpu idle
[ 3.026998] exiting cpu idle
[ 3.030578] entering cpu idle
[ 3.033599] exiting cpu idle
[ 3.040581] entering cpu idle  <<<--- and never exits (wmi instruction , arch/arm/mm/proc-v7.S  ENTRY(cpu_v7_do_idle))

Regards

angelo

0 Kudos
Reply

5,602 Views
angelo_dureghel
Contributor III

Adding some other findings:

  • reducing cpu_do_idle() call frequency, adding a 10msecs delay before entering, kernel does not hangs anymore.
  • also, entering/exiting low power mode without disabling clocks (CCM CLPCR bit 5 BM_CLPCR_ARM_CLK_DIS_ON_LPM not set), kernel does not hangs
0 Kudos
Reply

5,600 Views
angelo_dureghel
Contributor III

Hi Igor,

so, do you have some other hint about this ? Issue seems clearly connected to enabling/disabling ARM clock entering low power mode, in particular with the idle enter / exit frequency.

On similar issues someone disabled CONFIG_CPU_IDLE. I applied a temporary fix, avoiding setting BM_CLPCR_ARM_CLK_DIS_ON_LPM bit on CCM CLPCR register entering LP mode (pm-imx6.c near line 680). 

Regards

angelo

0 Kudos
Reply

5,600 Views
igorpadykov
NXP Employee
NXP Employee

Hi Angelo

 

could you try to reproduce issue on NXP Sabre SD reference board with Demo Images from
i.MX Software | NXP 

 

Best regards
igor

0 Kudos
Reply

5,600 Views
simonlocke
Contributor III

Incidentally, one thing we have been trying to do is to reproduce it on the Sabre Dev board, but so far have failed to do so. This is unhelpful for the investigation.

0 Kudos
Reply

5,600 Views
angelo_dureghel
Contributor III

Hi Igor, 

sorry, i don't have a sabre sd for the test.

0 Kudos
Reply

5,600 Views
igorpadykov
NXP Employee
NXP Employee

Hi Angelo

if industrial part is used (800Mhz) dts settings should be adjusted

        cpu0: cpu@0 {
            operating-points = <..
            fsl,soc-operating-points = <..

Best regards
igor

0 Kudos
Reply

5,600 Views
angelo_dureghel
Contributor III

Hi Igor,

operating points for industrial part are already set (we have IMX6X2CVN08A8:

operating-points = <
/* kHz uV */
792000 1175000
396000 1075000
198000 975000
>;
fsl,soc-operating-points = <
/* ARM kHz SOC uV */
792000 1175000
396000 1175000
198000 1175000
>;

As said, 4.9 kernel runs fine so it must be some issue related with 4.14.

There are several threads in this community that solved similar issues removing CONFIG_CPU_IDLE but none of them had a final explanation about the real issue.

It seems to be related to ERR007265 since cpu is not able to exit IDLE. But the fix for this is already there.

Regards,

angelo

0 Kudos
Reply

5,600 Views
igorpadykov
NXP Employee
NXP Employee

Hi Angelo

L4.14.113 kernel is not supported by nxp, one can try with latest official Linux L4.14.98_2.0.0

linux-imx - i.MX Linux kernel 

Documentation

i.MX Software | NXP 

Best regards
igor
-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

0 Kudos
Reply

5,600 Views
angelo_dureghel
Contributor III

Investigating further, i have found the "hang" point:

imx6sx cpu enters idle (wmi instruction , arch/arm/mm/proc-v7.S  ENTRY(cpu_v7_do_idle)) and never exits.

wmi requires an interrupt to wake up the cpu, and i am expecting at least some timer interrupt to happen, but system is completely frozen.

if any hint, welcome.

Thanks

angelo