We noticed suspend/resume to be broken on i.MX 8QuadPlus using SCFW from Linux 5.4.70_2.3.7 Patch. It used to work fine on the i.MX 8QuadPlus using the previous SCFW and it still works fine on the i.MX 8QuadMax. Therefore my question: Did NXP ever validate any of this on the i.MX 8QuadPlus? What exactly could be the issue? Thanks!
While resuming Apalis iMX8 QuadPlus the process gets stuck and does not proceed:
root@apalis-imx8-06602842:~# echo +10 > /sys/class/rtc/rtc1/wakealarm && echo enabled > /sys/class/tty/ttyLP1/power/wakeup && echo mem > /sys/power/state [ 13.453866] PM: suspend entry (deep) [ 13.462965] Filesystems sync: 0.004 seconds [ 13.775599] Freezing user space processes ... (elapsed 0.002 seconds) done. [ 13.784960] OOM killer disabled. [ 13.788255] Freezing remaining freezable tasks ... (elapsed 0.074 seconds) done. [ 13.873266] mwifiex_pcie 0000:01:00.0: None of the WOWLAN triggers enabled [ 13.886494] pcieport 0001:02:01.0: pciehp: Timeout on hotplug command 0x1038 (issued 9628 msec ago) [ 15.914487] pcieport 0001:02:01.0: pciehp: Timeout on hotplug command 0x0008 (issued 2020 msec ago) [ 16.731332] fec 5b040000.ethernet eth0: Link is Down [ 16.738094] usb3503 3-0008: switched to STANDBY mode [ 16.919295] PM: suspend devices took 3.052 seconds [ 16.975181] Disabling non-boot CPUs ... [ 16.979980] CPU1: shutdown [ 16.982795] psci: CPU1 killed (polled 0 ms) [ 16.990797] CPU2: shutdown [ 16.993600] psci: CPU2 killed (polled 0 ms) [ 17.001593] CPU3: shutdown [ 17.004412] psci: CPU3 killed (polled 0 ms) [ 17.011968] CPU4: shutdown [ 17.014796] psci: CPU4 killed (polled 0 ms) [ 17.022091] Enabling non-boot CPUs ... [ 17.026662] Detected VIPT I-cache on CPU1 [ 17.026692] GICv3: CPU1: found redistributor 1 region 0:0x0000000051b20000 [ 17.026740] CPU1: Booted secondary processor 0x0000000001 [0x410fd034] [ 17.027716] CPU1 is up [ 17.048469] Detected VIPT I-cache on CPU2 [ 17.048484] GICv3: CPU2: found redistributor 2 region 0:0x0000000051b40000 [ 17.048506] CPU2: Booted secondary processor 0x0000000002 [0x410fd034] [ 17.048970] CPU2 is up [ 17.069645] Detected VIPT I-cache on CPU3
Reply from the expert team:
=======================
I tried on a QP part (socketed MEK) and it works fine for me.
Some questions for the customer reporting the issue:
=======================
> I tried on a QP part (socketed MEK) and it works fine for me.
And what exact versions of things did you use for that test?
> 1. Can they please list all the different components (linux, scfw, uboot, atf etc)?
As mentioned before it is all based on downstream NXP Linux BSP 5.4.70_2.3.0 at the level of Linux 5.4.70_2.3.7 Patch:
Linux https://git.toradex.com/cgit/linux-toradex.git/log/?h=toradex_5.4-2.3.x-imx
SCFW https://github.com/toradex/i.MX-System-Controller-Firmware
U-Boot https://git.toradex.com/cgit/u-boot-toradex.git/log/?h=toradex_imx_v2020.04_5.4.70_2.3.0
ATF https://git.toradex.com/cgit/imx-atf.git/log/?h=toradex_imx_5.4.70_2.3.0
Anyway, basically OpenEmbedde/Yocto Project repo manifest from here:
https://git.toradex.com/cgit/toradex-manifest.git/log/?h=dunfell-5.x.y
> 2. Is the failure random or fails every time?
Fails every time.
> 3. Which earlier version of SCFW was working?
Linux 5.4.70_2.3.5 Patch (as we skipped the later Linux 5.4.70_2.3.6 Patch)
> 4. Any board changes between the two tests?
No, not really. And as mentioned before. It works just fine on the QuadMax, just not on the QuadPlus
Thanks!
As I cannot repro the issue on our side, we need more debug to be done on the board.
I would suggest the following to narrow down the issue:
1. Replace one at a time (uboot/atf/linux/scfw) from the 2.3.5 to 2.3.7. This may help identify which module is causing the issue.
2. Build SCFW with debug monitor (M=1) and enable SCFW uart. Type power.r at the point of failure to see the status of A72. It should be up since the voltage rail is at 1.1V. Also would like to know if there is any error reported on the SCFW console.
3. Connect with a JTAG debugger to see where A72 is hung.
4. Offline all A53 cores before suspend and see if the issue still persists.
> 1. Replace one at a time (uboot/atf/linux/scfw) from the 2.3.5 to 2.3.7. This may help identify which module is causing the issue.
Turns out it is indeed not the boot container (U-Boot/ATF/SCFW) but rather Linux itself! I also tried with the exact i.MX 8QuadMax device tree (instead of the QuadPlus one) but that did not help.
> 2. Build SCFW with debug monitor (M=1) and enable SCFW uart. Type power.r at the point of failure to see the status of A72. It should be up since the voltage rail is at 1.1V. Also would like to know if there is any error reported on the SCFW console.
Please find it attached:
apalis-imx8qp_scfw_bsp-5.6_booted-suspended-resumed.log: This is based on 2.3.5
apalis-imx8qp_scfw_bsp-5.7_booted-suspended-resume_failed.log: This is based on 2.3.7
> 3. Connect with a JTAG debugger to see where A72 is hung.
Unfortunately, I currently do not have any such environment available.
> 4. Offline all A53 cores before suspend and see if the issue still persists.
Indeed, that helps! What exactly does that mean now?
root@apalis-imx8-06677517:~# echo 0 > /sys/devices/system/cpu/cpu0/online
[ 415.415305] CPU0: shutdown
[ 415.418077] psci: CPU0 killed (polled 0 ms)
root@apalis-imx8-06677517:~# echo 0 > /sys/devices/system/cpu/cpu1/online
[ 419.167260] CPU1: shutdown
[ 419.169990] psci: CPU1 killed (polled 0 ms)
root@apalis-imx8-06677517:~# echo 0 > /sys/devices/system/cpu/cpu2/online
[ 422.375296] CPU2: shutdown
[ 422.378021] psci: CPU2 killed (polled 0 ms)
root@apalis-imx8-06677517:~# echo 0 > /sys/devices/system/cpu/cpu3/online
[ 425.428610] CPU3: shutdown
[ 425.431405] psci: CPU3 killed (polled 0 ms)
[ 427.298482] read temp sensor 0 failed, could be SS NOT powered up, return 0 for this thermal zone, ret -1
root@apalis-imx8-06677517:~# echo +10 > /sys/class/rtc/rtc1/wakealarm; echo mem > /sys/power/state
[ 440.696760] PM: suspend entry (deep)
[ 440.706600] Filesystems sync: 0.006 seconds
[ 440.808392] Freezing user space processes ... (elapsed 0.001 seconds) done.
[ 440.817354] OOM killer disabled.
[ 440.820613] Freezing remaining freezable tasks ... (elapsed 0.102 seconds) done.
[ 440.930796] printk: Suspending console(s) (use no_console_suspend to debug)
[ 440.950219] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 440.951417] sd 0:0:0:0: [sda] Stopping disk
[ 440.994201] pcieport 0000:02:01.0: pciehp: Timeout on hotplug command 0x1038 (issued 436772 msec ago)
[ 443.014216] pcieport 0000:02:01.0: pciehp: Timeout on hotplug command 0x0008 (issued 2020 msec ago)
[ 443.807018] fec 5b040000.ethernet eth0: Link is Down
[ 443.808691] usb3503 3-0008: switched to STANDBY mode
[ 443.933242] PM: suspend devices took 2.996 seconds
[ 443.978926] Disabling non-boot CPUs ...
[ 444.120230] imx6q-pcie 5f000000.pcie: PCIe PLL locked after 0 us.
[ 444.434230] imx6q-pcie 5f000000.pcie: Link up
[ 444.434234] imx6q-pcie 5f000000.pcie: Link: Gen2 disabled
[ 444.434238] imx6q-pcie 5f000000.pcie: Link up, Gen1
[ 444.488704] [drm] Started firmware!
[ 444.489817] [drm] HDP FW Version - ver 34219 verlib 20560
[ 444.489825] [drm] Pixel clock: 0 KHz, character clock: 0, bpc is 0-bit.
[ 444.489829] [drm] Pixel clk (0 KHz) not supported, color depth (0-bit)
[ 444.489845] [drm:cdns_hdmi_phy_set_imx8qm [cdns_mhdp_imx]] *ERROR* failed to set phy pclock
[ 444.491567] caam 31400000.crypto: registering rng-caam
[ 444.503369] usb3503 3-0008: switched to HUB mode
[ 444.768496] usb usb4: root hub lost power or was reset
[ 444.768501] usb usb5: root hub lost power or was reset
[ 444.962231] usb 3-1: reset high-speed USB device number 2 using ci_hdrc
[ 445.309767] configfs-gadget gadget: high-speed config #1: c
[ 445.434227] usb 3-1.2: reset high-speed USB device number 3 using ci_hdrc
[ 445.858229] usb 3-1.2.3: reset full-speed USB device number 4 using ci_hdrc
[ 445.981972] ahci-imx 5f020000.sata: external osc is used.
[ 445.984857] sd 0:0:0:0: [sda] Starting disk
[ 446.106221] pcieport 0000:02:01.0: pciehp: Timeout on hotplug command 0x0008 (issued 1672 msec ago)
[ 446.458220] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 446.480726] ata1.00: configured for UDMA/133
[ 447.304393] fec 5b040000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[ 448.126236] pcieport 0000:02:01.0: pciehp: Timeout on hotplug command 0x1028 (issued 2020 msec ago)
[ 448.127621] PM: resume devices took 3.640 seconds
[ 448.322846] OOM killer enabled.
[ 448.325984] Restarting tasks ... done.
[ 448.386447] PM: suspend exit
root@apalis-imx8-06677517:~#
Hello,
I got the reply from the expert
==========================
As customer mentioned, "Turns out it is indeed not the boot container (U-Boot/ATF/SCFW) but rather Linux itself!", therefore now the question is that what is changed in the Linux kernel code between 2.3.5 and 2.3.7. Since customer is using their own kernel repository https://git.toradex.com/cgit/linux-toradex.git/tree/?h=toradex_5.4-2.3.x-imx, they need to check that.
From our linux kernel repository https://source.codeaurora.org/external/imx/linux-imx/refs/tags, I checked the commit between tag rel_imx_5.4.70_2.3.2 and rel_imx_5.4.70_2.3.7, and found only the following commit is related to 8QM.
If customer didn't found anything suspicious in their own changes, they can try to revert this commit to check whether it's related.
===========================
Best regards,
Jimmy
And what exact versions of things did you use for that test?
RV - Used the version that was released but from NXP internal build.
A few more questions:
1. The board design between QuadMax and QuadPlus is the same?
2. Does every QP board fail?
3. Can you measure the PMIC voltage for A72 at the point of failure? I am not sure if you are using a PMIC or not.
> 1. The board design between QuadMax and QuadPlus is the same?
Yes, exactly the same.
> 2. Does every QP board fail?
Yes.
> 3. Can you measure the PMIC voltage for A72 at the point of failure?
I looked both at the VDD_A53 as well as VDD_A72 both on the QuadMax as well as the QuadPlus.
QM
before: both 1V
echo +30 > /sys/class/rtc/rtc1/wakealarm; echo mem > /sys/power/state
during: both off
after: both 1V
=> resumes just fine
QP
before: both 1V
echo +30 > /sys/class/rtc/rtc1/wakealarm; echo mem > /sys/power/state
during: both off
after: both 1.1V
=> just hangs
> I am not sure if you are using a PMIC or not.
Yes, as for the PMICs we do have a dual PF8100 design.
Latest logfiles attached.
Just to clarify, tested on the 5.4.70_2.3.7 release.
Any statement from NXP? Thanks!
Does it work on the MEK board?
(I don't think NXP validates the BSP on Apalis)
Well, I am not aware of any MEK board existing for the i.MX 8QuadPlus.