Watchdog reboot on overloading GPU in SCM IMX6Q

parthasarathyr · ‎08-14-2017

Hi,

I am using Linux imx-4.1.15-2.0.0_ga release and Yocto krogoth build.

I have stress tested GPU with glmark2 GPU application to check for over heating of IMX6 SCM when running GPU.

IMX6 SCM is very quickly reaching the threshold of 85 degree and reducing the GPU clock to 1/64. I have launched multiple instances of glmark2 application to overheat the module.

On doing this test occasionally i got system reboot, this happens after being in non-responsive state for sometime. Display is going blank when system enters the non-responsive state.

Is there any latest patch which fix this issue?

Is there any known issue particular to SCM module?

There is a separate release for SCM https://community.nxp.com/docs/DOC-333955

Will this make any difference in BSP/Kernel or in fixing this issue?

Looking forward for some response. Meantime i am planning to build above release and test it.

Thanks,

Partha

michaelguntli · ‎08-21-2017

alejandrolozano‌ juangutierrez maybe you can help with the patch "WA GPU 3D OT" to fix the arbitration within the memory controller?

The priority between GPU2D/GPU3D’s QoS transfer and non QoS transfer from the core may be overloading one of the bus fabric (PL301) then to the MMDC. MMDC (memory controller) prioritizes QoS flag and prevents access from non-QoS data traffic (CPU). If GPU process requirements gets heavy, overrun error may occur due to logic race and lack of resources on bus fabric.

View solution in original post

parthasarathyr · ‎08-29-2017

Hi michaelguntli‌ juangutierrez‌

Thanks for the patch, now we are not getting the watchdog reboot during GPU operation but we are getting display freeze while running GPU application. This happens on a long run(for more than a day).

Debug consoles are active and peripherals are active except the GPU.

Do you have any suggestions on tuning QoS values to solve this issue?

Thanks in advance,

Partha

michaelguntli · ‎08-29-2017

You mean your system locks up when you run a ARM CPU intensive application?

Interesting why the watchdog does not reset the system in that case..

parthasarathyr · ‎08-31-2017

Hi michaelguntli‌

Sorry for replying late.

No, there is no system lock up now. Only the display freeze when i run GPU application for a long time and i am able to access the debug console(so the processor is active).

Looks like GPU rendering is not happening(locked), only after manual reset i am getting the proper display.

Thanks,

Partha

michaelguntli · ‎08-31-2017

Which SCM-i.MX6Q memory configuration are you using? 512MB / 1GB / 2GB?
If you could post a picture of the chip that would help to identify the exact chip revision.

Example: SCM-i.MX6D with 1GB LPDDR2

parthasarathyr · ‎09-01-2017

Hi Michael,

We are using 1GB LPDDR2 from Micron(MT42L128M64D2).

Display freeze happens even with 640 x 480 resolution in HDMI.

So, without the above patch watchdog reset is happening due to CPU freeze and with the above patch display freeze is happening.

Looking forward for some help in solving this issue.

Thanks,
Partha

michaelguntli · ‎10-06-2017

Hi Partha. Any luck so far?
We are currently experimenting a little bit, since we also observe sporadically a similar behavior.

You can try the following:

Change min GPU clock to 3/64 instead of 1/64
..to prevent GPU hang in case the thermal protect mechanism is activated.
https://community.nxp.com/thread/319210#comment-378939
Increase VDD_ARM_IN and VDD_SOC_IN voltage to 1.275V
..according to the values of the datasheet in kernel-imx/arch/arm/boot/dts/imx6q.dtsi for LDO enabled mode.

taehyukkwon · ‎10-23-2017

Hi,

I am seeing the same issue on Android 4.3.1 build.

I tried to apply the same change to the kernel used in the build, but i cant find the files. Could you help me to have the equivalent changes for Android 4.3.1 kernel ?

Thanks,

michaelguntli · ‎10-24-2017

Since the kernel is always changing, I have no idea where it's located in older kernel versions. We are using Android Android 5.1.1 with Kernel 3.14.52.

/drivers/cpufreq/cpufreq.c is a good starting point to find the operation points.

taehyukkwon · ‎10-24-2017

Hi Michael,

the kernel version is

3.4.39

The files on the version is much different from the ones you use. I searched and find the file, arch/arm/mach-mx6/cpu_op-mx6.c. Is it the one to modify for my version ?

Thanks,

parthasarathyr · ‎10-12-2017

Hi michaelguntli‌

Thanks for the patch.

With this update we are not seeing reset or hang, we are continuing the test further to conclude on this.

Do you have any idea on the impact of this voltage change in SoC power consumption?

Will keep update once the testing is completed.

Thanks again for your continued support.

Regards,

Partha

michaelguntli · ‎10-12-2017

Good to hear!

We are exactly working right now on the same problem, that's why I was able to respond so quickly.

So I assume you are using the internal LDO of the i.MX6, and not use LDO-bypass for voltage generation?

Explanation of ldo-enabled vs. ldo-bypass: ventana/power – Gateworks

The setpoints above are only applied to the internal LDO. Besides power consumption, a major problem for us with the internal LDO is the additional heat it produces.

Good for you that we are already one step further: we are testing stability with LDO-bypass configuration. With LDO bypass we were able to reduce the temperature by around 15%.

LDO-Bypass:

SCM-i.MX6D has the PF0100 PMIC built in, so if you decide to use LDO bypass:

Enable fsl,ldo-bypass flag in kernel dts config
Fix the VDD_ARM_IN / VDD_SOC_IN mismatch in U-boot (at least it's wrong for the evaluation board)
Rebuild and verify stability, enjoy the reduced heat and power consumption :-)

Detailled explanation:

How to Enable LDO Bypass Based on i.MX6 Android ICS

Set ldo-bypass flag your dtsi file: kernel//arch/arm/boot/dts/imx6dscm-freeX.dts

fsl,ldo-bypass = <1>;

There is an important change which is required in the U-Boot: Fix the inverted mapping of VDD_ARM_IN and VDD_SOC_IN to the PMIC outputs (double check with your custom hardware, at least for the evaluation board it's wrong): u-boot/board/freescale/mx6dqscmqwks/mx6dqscmqwks.c

Bug: The schematics of the evaluation board does not match the code (inverted)!

Schematics evaluation board:

PFUZE100_SW1ABVOL = VDD_SOC_IN

PFUZE100_SW1CVOL = VDD_ARM_IN

Code:

PFUZE100_SW1ABVOL = VDD_ARM_IN

PFUZE100_SW1CVOL = VDD_SOC_IN

We are currently running the following config and testing the stability (1.175V VDD_ARM_IN, 1.20V VDD_SOC_IN):

u-boot/board/freescale/mx6dqscmqwks/mx6dqscmqwks.c

void ldo_mode_set(int ldo_bypass) {
...
   /* set SW1C to 1.175V (VDD_ARM_IN) to compensate ripple */
   pmic_reg_read(pfuze, PFUZE100_SW1CVOL, &value);
   value &= ~0x3f;
   value |= 0x23;
   pmic_reg_write(pfuze, PFUZE100_SW1CVOL, value);

   /* set SW1AB to 1.20V (VDD_SOC_IN) to compensate ripple */
   pmic_reg_read(pfuze, PFUZE100_SW1ABVOL, &value);
   value &= ~0x3f;
   value |= 0x24;
   pmic_reg_write(pfuze, PFUZE100_SW1ABVOL, value);
...

FYI: depending on your hardware setup, you might have to increase SW1C and SW1AB to compensate for ripple (it should NEVER get below 1.15V)

parthasarathyr · ‎10-13-2017

Hi michaelguntli Nice Work

Yes we are using internal LDO mode. We are not getting any issue with our 3 days of testing(apart from over heating issue).

Now we have started testing with LDO by-pass mode. Currently we have set CORE and SOC voltage to 1.3V(maximum value) in PMIC, will let you know the updates.

Thanks,

Partha

parthasarathyr · ‎10-13-2017

Hi michaelguntli‌

We are seeing some issue with ldo-bypass mode which is similar to the one discussed in the below thread:

https://community.nxp.com/thread/391453

Any thoughts?

Thanks,

Partha

michaelguntli · ‎10-15-2017

Hi Partha

No sorry, we are using an "old" kernel L3.14.52-ga which is part of Android L5.1.1: https://community.nxp.com/docs/DOC-329594

parthasarathyr · ‎10-16-2017

Hi michaelguntli‌,

We found a patch to fix the issue.

Below is the patch link:

Patch "regulator: anatop: allow regulator to be in bypass mode" has been added to the 4.4-stable tre...

Now we have started testing the with ldo bypass enabled.

Will let you know results soon.

Thanks,

Partha

michaelguntli · ‎10-19-2017

Hi Partha

Any feedback regarding LDO bypass operation?

Which voltages are you currently operating? Is the system still stable?

LPDDR2
VDD_ARM_IN
VDD_SOC_IN

parthasarathyr · ‎10-24-2017

Hi Michael,

We are able to reduce the ARM and SOC voltages to 1.2V without any hang or reboot issues. We are thoroughly testing it further.

LPDDR2 voltage is set to 1.25V as per the SCM patch.

Temperature wise we have seen some improvement.

When running CPU and GPU stress test:

Without ldo-bypass mode:

GPU runs in full speed for 10 seconds.(untill temperature threshold(85 degree) is reached).

GPU runs in reduced speed for 50 seconds.(takes more time to reduce to 75 degree) and then switches to full speed.

With ldo-bypass mode(SOC and ARM at 1.2V)

GPU runs in full speed for 10 seconds.(same as above)

GPU runs in reduced speed for 15 seconds i.e. temperature is quickly reducing to 75 degree in bypass mode.

Though it is better now but still we can't maintain GPU at full speed for long time, we may need to reduce the GPU maximum clock to achieve a consistent performance.

Thanks,

Partha

michaelguntli · ‎10-24-2017

Looks like we are doing similar things. :-)

I just recently reduced the GPU clock to 50% and I didn't notice any major performance degradation (we only have a 840x480px display). Power consumption was about 300mW lower.

File: kernel-imx/arch/arm/mach-imx/clk-imx6q.c

Change: Clock for gpu3d / gpu3d_shader / gpu2d divided by 8 instead of 4

michaelguntli · ‎10-16-2017

Hi Partha

Good to know, thanks for the patch!

Important: Please measure the ripple on your custom hardware design, the voltages VDD_ARM_IN and VDD_SOC_IN should never get below the specified minimum voltages.

FYI: We noticed higher ripple on VDD_SOC_IN when the system was stressed (e.g. GPU performance test).

parthasarathyr · ‎08-23-2017

Thanks @Michael Guntli for pointing to the workaround.

Thanks @Juan Antonio Gutierrez Rosas for the patch.

Looks like it is working properly without any watchdog triggered reboot.

I don't find any description about the modified registers(0x00C43108 and 0x00C48108) in TRM, is it not available to all?

Thanks,

Partha

Watchdog reboot on overloading GPU in SCM IMX6Q

Watchdog reboot on overloading GPU in SCM IMX6Q

SCM-i.MX6DQ

Suspected Software Defect