Dear NXP,
We are using the i.MX6DP processor on a custom board. After several tests we found an intermittent issue in the boot process with this processor, that hangs the device. After debugging the problem, we have detected that the hang occurs in the initialization of the Galcore module. In particular the hangs is detected in the function _ResetGPU() from drivers/mxc/gpu-viv/hal/kernel/arch/gc_hal_kernel_hardware.c , apparently there is a deadlock in the function to read the registers from the physical addr to identify if the MMU is enabled or not. This is the call that hangs our device:
gcmkONERROR(gckOS_ReadRegisterEx(
Hardware->os,
Hardware->core,
0x0018C,
®MmuCtrl
));
and It is the code snippet where the issue happens:
/* Force Disable MMU to guarantee setup command be read from physical addr */
if (Hardware->options.secureMode == gcvSECURE_IN_NORMAL)
{
gctUINT32 regMmuCtrl = 0;
gcmkONERROR(gckOS_ReadRegisterEx(
Hardware->os,
Hardware->core,
0x00388,
®MmuCtrl
));
mmuEnabled = (((((gctUINT32) (regMmuCtrl)) >> (0 ? 0:0)) & ((gctUINT32) ((((1 ? 0:0) - (0 ? 0:0) + 1) == 32) ? ~0U : (~(~0U << ((1 ? 0:0) - (0 ? 0:0) + 1)))))) );
}
else
{
gctUINT32 regMmuCtrl = 0;
gcmkONERROR(gckOS_ReadRegisterEx(
Hardware->os,
Hardware->core,
0x0018C,
®MmuCtrl
));
mmuEnabled = (((((gctUINT32) (regMmuCtrl)) >> (0 ? 0:0)) & ((gctUINT32) ((((1 ? 0:0) - (0 ? 0:0) + 1) == 32) ? ~0U : (~(~0U << ((1 ? 0:0) - (0 ? 0:0) + 1)))))) );
}
if (mmuEnabled)
{
/* Not reset properly, reset again. */
continue;
}
This issue appears intermittently in the Galcore initialization process, then I assume that it is a race condition in the reset GPU sequence.
Originally we found this issue with a BSP based on your 4.9 release but we also have the same issue with your BSP based on rel_imx_5.4.24_2.1.0
Could you provide us some guidance about how to fix it?
We noticed that there are several erratas on this processor related with the GPU that maybe could be related , but this one "ERR004341 GPU2D: Accessing GPU2D when it is power-gated will cause a deadlock in the system" is very similar to the issues that we are having.
Best regards
Arturo
Hello Arturo,
I have started to work on this issue and It will be great if you can provide the error log that will be helpful to investigate further.
Regards
Hi,
Attached you can find the console output with debug lines to check where the device hangs.
- console_output_failing_on_boot_no_debug.txt: Boot without any debug message. Device hangs after start the Galcore driver
- console_output_failing_on_boot_with_debug.txt: Boot with debug messages. Device hangs in the initialization of the Galcore module, and located the issue in the gckOS_ReadRegisterEx call.
- console_output_working_on_boot_with_debug.txt: Boot with debug messages. _ResetGPU() function works fine and the device boot without issues.
Take in mind that the file lines print in the debug messages are affected by the own debug messages.
Thanks,
Arturo.
Hello,
We have tried but this issue is difficult to reproduce at our end. Thus we are requiring some information from you :
1. Are you running any application or modifying any GPU related registers before this hang can occur? How much frequent the issue is?
2. Are you facing issues with 'gckOS_ReadRegisterEx' function every time when you meet this issue or the hang is happing from a different point of code each time? This will help us to get confirmation that this is the issue with GPU only.
3. Can you provide us some more debug logs inside the function 'gckOS_ReadRegisterEx'? As we are not able to reproduce this, the detailed information for the function gckOS_ReadRegisterEx and function 'readl' will be helpful in the code walk-through and issue isolation.
Regards
Hi,
- Are you running any application or modifying any GPU related registers before this hang can occur?
No, we don't modify any GPU register. The error happens on boot before we start any graphical backend
- How much frequent the issue is?
This issue is intermittent, but around the 10% of boot times occur.
- Are you facing issues with 'gckOS_ReadRegisterEx' function every time when you meet this issue or the hang is happing from a different point of code each time?
Before add even more debug the issue always was found in the "gckOS_ReadRegisterEx" function (in the readl function), however now we have several log where the issue happens with the "gckOS_WriteRegisterEx" function (just in the writel function)
- Can you provide us some more debug logs inside the function 'gckOS_ReadRegisterEx'?
When the device hangs, always fails in the readl() or writel() functions. I cannot provide debug lines inside the readl() or writel() because if I add it the system does not boot, I assume that these are low level functions that works in an interrupt environment and I can not print messages on it.
Attached you can find several logs from different boot sequences where the device hangs, and in many cases after a few minutes there is a kernel dump and the boot process continues. It seems like the CPU is stuck in a read/write operation and exits by timeout or something like this. Notice that all of these attached kernel dumps are from a kernel 4.9 based on your imx_4.9.88_2.0.0_ga tag, because the product is based on this kernel.
- console_output_failing_on_boot_debug_4_9.txt: Boot with debug message. Device hangs after start the Galcore driver. Check the kernel dump and verify the stuck on the gckOS_ReadRegisterEx function
- console_output_failing_on_boot_more_debug_4_9.txt: Boot with debug message and timestamp. Device hangs after start the Galcore driver and continues after 2 min. Check the kernel dump and verify the stuck on the gckOS_WriteRegisterEx function.
- console_output_failing_on_boot_more_debug_4_9_log2.txt:Boot with debug message and timestamp. Device hangs after start the Galcore driver and continues after 9 min. Check the kernel dump and verify the stuck on the gckOS_WriteRegisterEx function.
In parallel we found that in the imx_4.9.88_2.0.0_ga tag, there is an issue related with the i.MX6QP/DP platform. This commit MLK-16266-02 ARM: imx: Enhance the code to support new TO for imx6qp introduced a comparison in the file drivers/clk/imx/clk-imx6q.c , using the function clk_on_imx6q() that checks the compatibility with "fsl,imx6q " to identify the i.MX6QP/DP platforms
static inline int clk_on_imx6q(void)
{
return of_machine_is_compatible("fsl,imx6q");
}
But these platforms does not have this compatibility machine, the i.MX6QP/DP platforms have the "fsl,imx6qp", and then it should be used the function clk_on_imx6qp()
static inline int clk_on_imx6qp(void)
{
return of_machine_is_compatible("fsl,imx6qp");
}
Attached you can find the patch 0001-ARM-imx-clk-imx6q-fix-clocks-initialization-to-i.MX6.patch to fix it over imx_4.9.88_2.0.0_ga . Please have a look into this issue and provide info about if the patch is correct and if there is anything wrong related with the clocks initialization. Without that patch for example, the GPU clocks are not initialized and maybe it is related with our boot issue doing the __ResetGPU().
Thanks,
Arturo.
Hello,
We have gone through the details of the log and identified that you are using L5.4.64 which is currently not released yet by NXP officially. Could you try with L5.4.24 BSP? If the issue still persists with L5.4.24 then please provide the logs for the same(whether the issue is with gckOS_Read/WriteRegisterEx only!). We also want to verify the DTS settings for the GPU module that you are using. Could you please help us to get the dts file?
Regards
<duplicate content removed>
<duplicate content removed>
<duplicate content removed>
Hi,
We found this issue in our release based on imx_4.9.88_2.0.0_ga, we only perform a quick test over rel_imx_5.4.24_2.1.0 to check if we obtained the same issue or not, but our product is based on v4.9 not on v5.4, then we need to fix it for imx_4.9.88_2.0.0_ga not for imx_5.4.24_2.1.0, Could you help us to debug and fix it for imx_4.9.88_2.0.0_ga?
Also, Could you answer us about the clocks initialization issue that I reported in my previous comment?
In parallel, I will try to obtain more info in v5.4 but take in mind that our product is based on v4.9 not in v5.4 and it is not easy to debug it for us on v5.4
Thanks,
Arturo.
Hi,
I was able to reproduce the issue with a Linux kernel based on your rel_imx_5.4.24_2.1.0 tag.
Attached you can find the following files:
- console_output_failing_on_boot_with_debug_5_4_24.txt: Boot with debug messages. Device hangs in the initialization of the Galcore module, in the gckOS_ReadRegisterEx call
- imx_5_4_24_GPU_dts.txt : DTS settings for the GPU node
Thanks for your support,
Arturo.
Hi,
After investigating the issue in deep, we found that it is related with the power on sequence on the GPU/VPU power domain.
It is related with the PU domain LDO from power down state, and how finally it is set in Linux.
The LDO_PU regulator is configured in the PMU_REG_CORE register, also the LDO_SOC and LDO_ARM. By default, in U-Boot these regulators have the following configuration:
However, when Linux boots these regulators are configured with a different voltage:
When the device hangs in the GPU initialization the voltage of the LDO_PU remains with the original value (1,150V) from U-Boot.
We verified that there are several commits in your BSP to fix different issues related with a GPU crash, waiting for the PU LDO power on ramp or keeping the PU domain on to avoid issues with the GPU driver probe. This is very similar to what we are seeing and it makes sense.
We implement a workaround, setting the PU power domain on U-Boot to the same voltage than in Linux, to avoid any kind of change and after that we are not able to reproduce the issue anymore.
We need to test it more, because it is quite difficult to reproduce it, but apparently this workaround fixed the issue.
We need some guidance about this issue and we have several questions:
Also we have an additional questions regarding the voltage measures that we did on this power domains. We check it in your SABRE Development Board with the same result than in our custom board. When we configure any of this voltage ranges (1,150V or 1,175V) on these three regulators (LDO_SOC, LDO_PU and LDO_ARM), we obtain a higher value (around a 15-20mV more). According with the CPU datasheet, there are defined a 25mV core voltage steps, then we have several questions about the accuracy of the internal CPU regulators.
Thanks,
Arturo.
Hi,
After investigating the issue in deep, we found that it is related with the power on sequence on the GPU/VPU power domain.
It is related with the PU domain LDO from power down state, and how finally it is set in Linux.
The LDO_PU regulator is configured in the PMU_REG_CORE register, also the LDO_SOC and LDO_ARM. By default, in U-Boot these regulators have the following configuration:
However, when Linux boots these regulators are configured with a different voltage:
When the device hangs in the GPU initialization the voltage of the LDO_PU remains with the original value (1,150V) from U-Boot.
We verified that there are several commits in your BSP to fix different issues related with a GPU crash, waiting for the PU LDO power on ramp or keeping the PU domain on to avoid issues with the GPU driver probe. This is very similar to what we are seeing and it makes sense.
We implement a workaround, setting the PU power domain on U-Boot to the same voltage than in Linux, to avoid any kind of change and after that we are not able to reproduce the issue anymore.
We need to test it more, because it is quite difficult to reproduce it, but apparently this workaround fixed the issue.
We need some guidance about this issue and we have several questions:
Also we have an additional questions regarding the voltage measures that we did on this power domains. We check it in your SABRE Development Board with the same result than in our custom board. When we configure any of this voltage ranges (1,150V or 1,175V) on these three regulators (LDO_SOC, LDO_PU and LDO_ARM), we obtain a higher value (around a 15-20mV more). According with the CPU datasheet, there are defined a 25mV core voltage steps, then we have several questions about the accuracy of the internal CPU regulators.
Thanks,
Arturo.
Hi,
Below are the inline answers for your query:
[RS] We have verified the regulator voltages in uboot as well as in kernel. In L4.9.88 BSP, regulator values have been set the same as what it is in your case for uboot. However, for the kernel, we are getting something different(No regulation). Have you done any changes to PMU_REG_CORE in the kernel? Can you provide us the output 0x20c8140 register using memtool? Can you please share the reg_pu, reg_soc node from dts file with us?
[RS] In the datasheet of processor, Table 6 Operating Ranges in below snap, the minimum VDD_PU_CAP requirement is 1.15 V. If in the case due to the fluctuation in voltage the value goes below 1.15, then there are chances that the PU won't get enough supply to turn on and device hangs there.
[RS] I am looking that how power on-ramp taking place and will get back to you again for this.
[RS] What I understand is there are 25mv core steps between two subsequent measures are mentioned in the datasheet. However, It is hard to digest that the actual tolerance we get is about 15-20mv (~ can be represented as the next step). We are investigating this yet. However, In your case, this should not be a cause of issue as the set value in uboot for LDO_PU(1,150) and the actual measurement which is 15-20 mv above is enough to provide the power to PU units.
I will revert back to you with more findings on this.
Regards
Hi,
Regarding your questions about Linux behavior, you can find my comments below:
[RS] We have verified the regulator voltages in uboot as well as in kernel. In L4.9.88 BSP, regulator values have been set the same as what it is in your case for uboot. However, for the kernel, we are getting something different(No regulation).
Have you done any changes to PMU_REG_CORE in the kernel?
Can you provide us the output 0x20c8140 register using memtool?
root@ccimx6sbc:~# ./memwatch -r -w -a 0x20c8140 -l 4
0x020c8140: 0x004c260b
root@ccimx6sbc:~#
Can you please share the reg_pu, reg_soc node from dts file with us?
reg_pu: regulator-vddpu {
compatible = "fsl,anatop-regulator";
regulator-name = "vddpu";
regulator-min-microvolt = <725000>;
regulator-max-microvolt = <1450000>;
regulator-enable-ramp-delay = <150>;
anatop-reg-offset = <0x140>;
anatop-vol-bit-shift = <9>;
anatop-vol-bit-width = <5>;
anatop-delay-reg-offset = <0x170>;
anatop-delay-bit-shift = <26>;
anatop-delay-bit-width = <2>;
anatop-min-bit-val = <1>;
anatop-min-voltage = <725000>;
anatop-max-voltage = <1450000>;
regulator-allow-bypass;
};
reg_soc: regulator-vddsoc {
compatible = "fsl,anatop-regulator";
regulator-name = "vddsoc";
regulator-min-microvolt = <725000>;
regulator-max-microvolt = <1450000>;
regulator-always-on;
anatop-reg-offset = <0x140>;
anatop-vol-bit-shift = <18>;
anatop-vol-bit-width = <5>;
anatop-delay-reg-offset = <0x170>;
anatop-delay-bit-shift = <28>;
anatop-delay-bit-width = <2>;
anatop-min-bit-val = <1>;
anatop-min-voltage = <725000>;
anatop-max-voltage = <1450000>;
regulator-allow-bypass;
};
To be aligned, we are using on your SABRE development board the prebuild images from L4.9.88_2.0.0_images_MX6QPDLSOLOX.tar.gz (md5sum e6bbd64885d563c059f0e7ca637c1ab6), in particular the image fsl-image-qt5-validation-imx-xwayland-imx6qpdlsolox.sdcard (md5sum 85bc7f57d99222db99c2cacab27dd535)
Thanks,
Arturo.
Hi,
I gone through the details. Yes, you were right, I had checked with by-pass ldo setting thus was getting (0x020c8140: 0x007c3e1f). After setting with ldo mode, I was getting (0x020c8140: 0x004c260b) only which is are the correct minimal voltage to supply three regulators.
Now let me describe for your other queries which I was working for,
[RS] I this case, It looks like the issue is, the regulator does not get enough power to work. Power on-ramp should not cause the problem here as far as the proper power sequence is maintained.
[RS] From the investigation I found that +2% tolerance is acceptable. For assurance, it is advisable to keep the voltage setting to what we set for Linux if GPU is included i.e. 1,175mv (~1,150mv+2%)as the minimum requirement is 1,150mv.
Let me know if a customer still has any queries.
Regards
Hi,
After focused our tests in the PU power domain line, we are sure that we have the enough voltage to supply these LDOs, because we are powering VDDSOC_IN from a dedicated, 2.5A capable regulator. When enabling the internal SoC LDOs, the voltage at VDDSOC_IN is configured to be 1.37V. This voltage should be sufficient for the 1.175V operating voltage set for VDD_SoC and VDD_PU. In another tests we increase this VDDSOC_IN to a higher value (1.47V) obtaining the same GPU hangs issue.
We measured your SABRE development board, and we noticed a different behavior with our device on this PU power domain using the LDO configuration along the boot process.
We tested two different release images on your SABRE development board and we obtain the same behavior:
- L4.9.88_2.0.0_images_MX6QPDLSOLOX.tar.gz (md5sum e6bbd64885d563c059f0e7ca637c1ab6): image fsl-image-qt5-validation-imx-xwayland-imx6qpdlsolox.sdcard (md5sum 85bc7f57d99222db99c2cacab27dd535)
- L5.4.47-2.2.0_images_MX6QPDLSOLOX.zip (md5sum a251187d9ac7eff142dd3f3653b242e2): image imx-image-multimedia-imx6qpdlsolox.wic (md5sum )
What we found is in U-Boot we have the same PMU_REG_CORE value ( 020c8140: 004c2412 ), and after the boot process this register has the same value than in our design (0x020c8140: 0x004c260b), however in the boot process in your SABRE development board increased this PU power domain from 1.150V to 1.27V and after finished the boot process it is set to 1.175V
- Why you increase this power domain until 1.27V?
- What is the reason to do that?
- Where are you doing this voltage increase?
Maybe it's the clue to why the problem cannot be reproduced on your SABRE board.
[off-topic]
The sdcard image from L5.4.47-2.2.0_images_MX6QPDLSOLOX.zip does not boot from scratch. I obtain the following error booting the default environment:
U-Boot 2020.04-5.4.47-2.2.0+gffc3fbe7e5 (Sep 11 2020 - 19:11:41 +0000)
CPU: i.MX6QP rev1.0 996 MHz (running at 792 MHz)
CPU: Automotive temperature grade (-40C to 125C) at 32C
Reset cause: POR
Model: i.MX6 Quad SABRE Smart Device Board
Board: MX6-SabreSD
DRAM: 1 GiB
PMIC: PFUZE100! DEV_ID=0x10 REV_ID=0x21
MMC: FSL_SDHC: 1, FSL_SDHC: 2, FSL_SDHC: 3
Loading Environment from MMC... *** Warning - bad CRC, using default environment
No panel detected: default to Hannstar-XGA
Display: Hannstar-XGA (1024x768)
In: serial
Out: serial
Err: serial
switch to partitions #0, OK
mmc2 is current device
flash target is MMC:2
Net:
Warning: ethernet@02188000 using MAC address from ROM
eth0: ethernet@02188000 [PRIME]
Fastboot: Normal
Normal Boot
Hit any key to stop autoboot: 0
switch to partitions #0, OK
mmc2 is current device
9004016 bytes read in 452 ms (19 MiB/s)
Booting from mmc ...
54638 bytes read in 19 ms (2.7 MiB/s)
Wrong Image Format for bootm command
ERROR: can't get kernel image!
=>
After debug the issue I found that the default tee_file environment var is set to "uTee-6qpsdb" and does not match with any file in the sdcard image.
=> printenv tee_file
tee_file=uTee-6qpsdb
=>
However in the boot partition from the sdcard there is a file called "uTee-6qsdb"
=> fatls mmc 2:1
52039 imx6dl-sabreauto.dtb
[...]
[...]
1057260 tee.bin
1057324 uTee-6qsdb
9004016 zImage
51 file(s), 0 dir(s)
=>
If I manually set the tee_file to "uTee-6qsdb", I can boot these images.
Thanks,
Arturo.
Hi,
Any comment on my last questions?
- Why you increase this power domain until 1.27V?
- What is the reason to do that?
- Where are you doing this voltage increase?
Any help to find the root cause of the problem will be appreciated.
Thanks,
Arturo
Hi,
Any feedback about these questions?
Thanks,
Arturo.