Hi all,
i am working on a 4.14.113 kernel, applying all patches from freescale linux-imx 4.14.78_1.0.0_ga_var01, working on a custom imx6sx board similar to sabresd.
I am experiencing a total kernel hangs around the ext4 rootfs mount. There are no messages useful for debug, no initcall is ongoing, no oops messages, and the freeze point may vary in time, depending if i power cycle fully the board or if i just reset by pwr reset button. If i set a timer in a driver probe, it also stops so the kernel totally freeze.
I finally found out the the issue is caused by CONFIG_CPU_IDLE. Removing it, boot process completes properly, and i reach the prompt. Also, rolling back to kernel 4.9, it works properly with CONFIG_CPU_IDLE enabled.
Considering CONFIG_CPU_IDLE related code may not be the real cause, and that i could have some hw issues too, i am wondering if you know of any imx6sx CONFIG_CPU_IDLE related issue, maybe fixed after 4.14, or if you may know of any similar issue.
Thanks
angelo
Hi SImon,
thanks for your feedbacks, i used now
prep_anatop_bypass();
set_anatop_bypass(1);
finish_anatop_bypass();
//udelay(100);
printf("LDO %08x\n", *((unsigned int *)0x20C8140));
LDO 007c001f
It is enabled. Also enabled it in the device tree adding
®_arm {
vin-supply = <&sw1a_reg>;
regulator-allow-bypass;
};
®_soc {
vin-supply = <&sw1a_reg>;
regulator-allow-bypass;
};
At least on our case we have no benefits from this.
Thanks
angelo
Hi Simon and Igor,
Igor,
does the fact that cpu hangs entering low power mode at wait (wfi instruction, and gating clocks) gives you any idea ? Every hint is welcome.
Simon,
thanks for your tip, tried but it gives not benefits.
Btw, does LDO bypass (fsl,ldo-bypass=<1>;) means that CPU core and SoC power supplies (as VDDARM_IN VDDSOC_IN) need to be provided from externals LDO or PMIC ? If yes, since we have also a board without pmic, i don't think it suits our case.
I did some other tests:
setting "Timer tick handling" to "Periodic timer ticks (constant rate, no dynticks)" , end putting a 10msec before
entering low power /idle, all works. This is just a test of course, but curious thing is that it only works with the mdelay "before" entering idle (wfi).
Looking kernel code for msleep it results in a cpu_relax() that seems to result in a "barrier()". Btw, it is clear that applying some "relax" to the cpu before entering LPM/idle is allowing the cpu to enter wait mode properly.
Also,
- we have here 2 boards, issue is exactly the same on the 2 boards, so would exclude soldering / manufacturing issues.
- we have another model of board with exactly same imx6sx cpu and it is not hanging
- another curious thing, i tried kernel 5.2.0 from mainline with a minimal dts, cpu hangs in the same way
So this makes me think there is some hw design issue in this board model only, as conflicting pins, or not sufficient/stable supply current, or something to check in VCAP capacitors / voltages.
For now, my best fix seems still disabling CONFIG_CPU_IDLE. At my opinion, consumption increment should be minimal, but i cannot find appropriate measurements on this. Any reference is welcome.
Regards,
angelo
Angelo,
We've arrived at much the same conclusion. We also have several boards, all the same, which have the same behaviour with this version of the kernel. I have seen changes in the behaviour of the lock up depending on the delay before entering do_idle. We also are working around the problem (of booting with LDOs enabled) by applying a kernel patch to make the imx6sx_enter_wait function just return index.
Ref: https://source.codeaurora.org/external/imx/linux-imx/tree/arch/arm/mach-imx/cpuidle-imx6sx.c?h=imx_4...https://source.codeaurora.org/external/imx/linux-imx/tree/arch/arm/mach-imx/cpuidle-imx6sx.c?h=imx_4...
We have then found that by booting with LDOs enabled and no CPU do_idles happening, the processor seems to work well. Although it is using more power than it would otherwise do. We then do not have any issues with PCIe or USB enumeration. However we do still have an issue with M4 co-processor booting, so that may well be a different issue.
The LDO bypass mode is useful when you are not using a PMIC and the voltage you are supplying is not high enough to allow the CPU to voltage/frequency scale correctly. See Ref Manual Section 48, and Data sheet section 4.1.3. Be careful though, as with no LDO your VDD_ARM_CAP and VDD_SOC_CAP could exceed specification.
I'm slightly surprised that it did not allow your board to boot by enabling LDO bypass. There are several parts to it though. Firstly you must configure the mode from uboot, because uboot reads your kernel device tree before it boots the kernel. Look at the sabre dev board example here:
The value passed into this function is true if the fsl,ldo-bypass=<1> in your kernel device tree. Your uboot code must call prep_anatop_bypass(), set_anatop_bypass(1) and finish_anatop_bypass(). This switches the LDO into bypass mode just before it boots the kernel.
Next you may also need the enable-bypass options on the core regulators in your device tree:
However, be aware of my comments from above. By enabling LDO bypass we caused ourselves other issues; like PCIe not working. However, it may work OK for you, depending on your set up.
Hi Angelo
"putting a 10msec before entering low power /idle, all works" may point to
power supply (LDO) timing issues, similar to ERR005852 i.MX6Q and attached old patch
errata decription for ldo ramp times. Before executing wfi (or exit) processor supplies
should be in accordance to "Standby/DSM Mode" Table 10. Operating Ranges Datasheet
for specific i.MX6SX part. So one can check and measure with oscilloscope VDD_ARM_IN (CAP),
VDD_SOC_IN(CAP), so for example voltage is lowered to ""Standby Mode" but processor
still executes some thread in normal mode. The same for exit procedure. In hardware one
can check capacitors on VDDHIGH_CAP, NVCC_PLL_OUT, VDD_ARM_CAP, VDD_SOC_CAP
- are they in accordance to latest i.MX6SX Hardware Guide
https://www.nxp.com/webapp/Download?colCode=IMX6SXHDG
Best regards
igor
Hi Igor and Simon,
Igor,
many thanks, will check those voltages.
Simon,
i have an older 2016 u-boot pre-dts, so i now enabled LDO bypass from board_init
unsigned int volatile *p = (unsigned int volatile *)0x20C8140;
for (x = 0; x < 4; ++x) {
reg = *p;
reg |= 0x1f;
reg |= (0x1f << 18);
*p++ = reg;
}
LDO bypass should be enabled, but kernel still hangs later on at the same point.
Hi Angelo,
Looks like maybe its not working. However, I'm not sure you should be modifying all four of the control regs in sequence (Value, Set, Clear & Toggle). You probably only want to write to the first register, which is the 'value', as opposed to the set, clear & toggle.
I appreciate about an older uboot. We were using a 2015-04 until recently when we moved our whole code base up to the latest. However, looking in that version I have found the relevant functions:
This shows the 3 functions to switch over to LDO mode safely. Presumably, there should also be the ldo_mode_set function in your board specific source code? You could just make that call the three LDO bypass functions directly.
We will in the background continue to research this issue, and do hope to fully resolve it. Any more info I get on it I will post here.
Anyway, we now have a work around for our first 3 issues, and we've found the solution to the 4th issue, which I am posting for other people to know about:
Hi Angelo,
I have sympathy for your frustration!
We've been investigating some very similar issues with a custom iMX6SX board, where we are using 4.14.78. We previously used 4.1.15 with no issues on a very similar board. Now we want to move to a later kernel and have had 4 different issues that cause the A9 Linux to freeze. They are:
We have also noted that it appears to potentially be something to do with CPU_FREQ or CPU_IDLE, but I have recently noted the following issue: https://community.nxp.com/thread/440575 which references ERR009572 and have begun wondering if this is anything to do with it, as our silicon does appear to be in the matching date code range.
I'm afraid I cannot suggest anything to help, as we are pretty stumped about this issue. I think I may try looking at some of the things you suggest above though.
Hi Simon,
thanks, happy to share the similar issue, but maybe mine, very similar, seems niot the same. I havea v.1.3 so should not be involved in ERR009572
This is my current analysis, as a facts list:
1) system hang happens while or after entering "wfi" assembly instruction (entering low power mode)
2) system hangs is _not_ happening on same custom imx6sx-based board with kernel 4.9.175,
3) system hangs is happening on same hardware with kernel 4.14.78 (or 113) even using same dtb and same config of 4.9,
4) disabled nearly all the devices from device tree, does not change anything, system hang is still there,
5) setting "Timer tick handling" to "Periodic timer ticks (constant rate, no dynticks)", then the issue seems related to the frequency of entering/exting idle (cpu_do_idle()) since adding some printk or a 20ms delay before entering low power mode, the system works,
6) on a similar custom board, imx6sx-based, with kernel 4.14.78 (or 113), the issue is not manifesting,
7) with kernel 4.14.78, pre-starting and loading M4 FW from u-boot or not, does not change anything,
8) a system hang after "wfi" instruction seems really to be connected to ERR007265 but the fix for this is there for a long time before 4.14 and still the same in the mainline.
9) both imx6sx board models are not using PMIC but simpler LDO-based PS.
10) disabling CCM CLPCR BM_CLPCR_ARM_CLK_DIS_ON_LPM bit entering idle, issue disappears, but system seems not stable (getting some oops related to memory).
So, due to the fact that the frequency of entering/exiting idle matter, one theory is that this issue is related to cpu power supply design. Maybe in kernel 4.9 the frequency of entering idle is lower, will try to measure this generating a signal on gpio.
Also, will try to connect by usb to execute ddr3 tiuning, maybe ddr3 init sequence is not perfect for this board.
Will let you know.
Hi again Angelo & Igor,
I have now successfully narrowed my 1st issue I described down to the SAME point that Angelo describes. The ARM7 freezes on the call to cpu_do_idle, which according to the built system.map, tallies with the wfi instruction in arch/arm/mm/proc-v7.S assember ENTRY(cpu_v7_do_idle). The behaviour of how quickly it ends up frozen is affected by how much trace I put into the build. This again appears to tally with Angelo's results.
Angelo, the way we found to get around this boot stopping issue was to set the LDO bypass in the device tree; which is done by setting "fsl,ldo-bypass=<1>;" on the gpc node. However, this is not a valid work around for us, as it then seems to mean that we have the 3 other issues I described earlier.
Igor, do you or any other colleagues have any further thoughts on this one?
Thanks Angelo,
I think we've ruled ERR009572 out now as we have confirmed we do have the PCIe Enabled variant of the device (based on the part number). So that was a false lead.
However, generally speaking some of what you describe does sound very similar in principle to our issue in that the processor hangs and we have been able to get it down to being something to do with sleep mode. I will try changing the frequency of entering/exiting idle as you suggest to see if that has an effect as you describe.
Also, we have run the DDR3L calibration tool on the board to give us correct DDR settings, and this has been built into uboot. Doing this did not change anything for us.
Hi Igor,
thanks for the support.
I tested linux-imx-4.14.98_2.0.0_ga, boot fails at the same way.
[ 2.606218] mmc1: host does not support reading read-only switch, assuming write-enable
[ 2.621992] mmc1: new high speed SDHC card at address e624
[ 2.629370] mmcblk1: mmc1:e624 SL08G 7.40 GiB
[ 2.638338] mmcblk1: p1 p2
[ 2.685734] vf610-adc 2280000.adc: Timeout for adc calibration
[ 2.702368] fsl-asrc 2034000.asrc: failed to get spba clock
[ 2.712155] fsl-ssi-dai 202c000.ssi: No cache defaults, reading back from HW
[ 2.728317] NET: Registered protocol family 10
[ 2.736145] Segment Routing with IPv6
[ 2.740085] sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
[ 2.747842] NET: Registered protocol family 17
[ 2.752357] can: controller area network core (rev 20170425 abi 9)
[ 2.758828] NET: Registered protocol family 29
[ 2.763312] can: raw protocol (rev 20170425)
[ 2.767817] can: broadcast manager protocol (rev 20170425 t)
[ 2.773534] can: netlink gateway (rev 20170425) max_hops=1
[ 2.779732] Key type dns_resolver registered
[ 2.818704] Registering SWP/SWPB emulation handler
[ 2.881997] A9-M4 sema4 num 6, A9-M4 magic number 0x12345678 - 0x7ed17773.
[ 2.890141] genpd_xlate_onecell: invalid domain index 2
[ 2.896870] genpd_xlate_onecell: invalid domain index 2
[ 2.905122] (NULL device *): hwmon_device_register() is deprecated. Please convert the driver to use hwmon_device_register_with_info().
[ 2.918771] imx_thermal 2000000.aips-bus:tempmon: Industrial CPU temperature grade - max:105C critical:100C passive:95C
[ 2.931360] genpd_xlate_onecell: invalid domain index 2
[ 2.937863] genpd_xlate_onecell: invalid domain index 2
[ 2.946213] snvs_rtc 20cc000.snvs:snvs-rtc-lp: setting system clock to 1970-01-01 00:32:32 UTC (1952)
[ 2.957696] usb_otg1_vbus: disabling
[ 2.961398] backlight-pwr: disabling
[ 2.965000] PSU-5V0: disabling
[ 2.968256] can-en: disabling
[ 2.971256] can-stby: disabling
[ 2.974423] ALSA device list:
[ 2.977467] No soundcards found.
[ 2.993011] entering cpu idle
[ 2.996015] exiting cpu idle
[ 3.000387] entering cpu idle
[ 3.003390] exiting cpu idle
[ 3.010460] entering cpu idle
[ 3.013463] exiting cpu idle
[ 3.017413] entering cpu idle
[ 3.020441] exiting cpu idle
[ 3.023976] entering cpu idle
[ 3.026998] exiting cpu idle
[ 3.030578] entering cpu idle
[ 3.033599] exiting cpu idle
[ 3.040581] entering cpu idle <<<--- and never exits (wmi instruction , arch/arm/mm/proc-v7.S ENTRY(cpu_v7_do_idle))
Regards
angelo
Adding some other findings:
Hi Igor,
so, do you have some other hint about this ? Issue seems clearly connected to enabling/disabling ARM clock entering low power mode, in particular with the idle enter / exit frequency.
On similar issues someone disabled CONFIG_CPU_IDLE. I applied a temporary fix, avoiding setting BM_CLPCR_ARM_CLK_DIS_ON_LPM bit on CCM CLPCR register entering LP mode (pm-imx6.c near line 680).
Regards
angelo
Hi Angelo
could you try to reproduce issue on NXP Sabre SD reference board with Demo Images from
Incidentally, one thing we have been trying to do is to reproduce it on the Sabre Dev board, but so far have failed to do so. This is unhelpful for the investigation.
Hi Igor,
sorry, i don't have a sabre sd for the test.
Hi Angelo
if industrial part is used (800Mhz) dts settings should be adjusted
cpu0: cpu@0 {
operating-points = <..
fsl,soc-operating-points = <..
Best regards
igor
Hi Igor,
operating points for industrial part are already set (we have IMX6X2CVN08A8:
operating-points = <
/* kHz uV */
792000 1175000
396000 1075000
198000 975000
>;
fsl,soc-operating-points = <
/* ARM kHz SOC uV */
792000 1175000
396000 1175000
198000 1175000
>;
As said, 4.9 kernel runs fine so it must be some issue related with 4.14.
There are several threads in this community that solved similar issues removing CONFIG_CPU_IDLE but none of them had a final explanation about the real issue.
It seems to be related to ERR007265 since cpu is not able to exit IDLE. But the fix for this is already there.
Regards,
angelo
Hi Angelo
L4.14.113 kernel is not supported by nxp, one can try with latest official Linux L4.14.98_2.0.0
Documentation
Best regards
igor
-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------
Investigating further, i have found the "hang" point:
imx6sx cpu enters idle (wmi instruction , arch/arm/mm/proc-v7.S ENTRY(cpu_v7_do_idle)) and never exits.
wmi requires an interrupt to wake up the cpu, and i am expecting at least some timer interrupt to happen, but system is completely frozen.
if any hint, welcome.
Thanks
angelo