kernel memory corruption in galcore under high load

tbueno · ‎07-19-2019

Hi everyone,

I am encoutering an issue with what I suspect is a galcore crash. For testing purposes, I put an i.MX6 QuadPlus under high load (CPU, VPU and GPU). After a few seconds or minutes of this workload (it's random-ish), I have a kernel Oops or panic that leads to the complete freeze of the system.

I can reproduce this problem consistently with two kernel/graphics drivers versions:

- Community's linux-fslc 4.9 with Vivante 6.2.4

- NXP's latest linux-imx 4.19 with Vivante 6.4.0

I build the OS images myself using Yocto Thud (meta-freescale).

On Linux 4.9, kernel only reports what look to be memory corruptions (invalid addresses), whereas Linux 4.19 is more explict as to the root cause of the crash, by explicitely mentionning galcore in the stack dump.

Furthermore, problem only occurs when GPU load is close to 100% (as measured by "on" cycles). Whatever the CPU/VPU load are, system will not crash until I start an intensive OpenGL app such as glmark2 which pushes the GPU to its limits.

I have attached in text files a few of those crashes for each of the two kernel versions.

Is this a known issue with galcore ? Are there any workarounds to prevent this from happening ?

Thanks and best regards,

Théo Bueno.

oleksandr_andru · ‎12-10-2019

Thank you for the suggestions, but it seems that this is rather a software problem then HW: if we set galcore.contiguousSize to 8/32/128M we see those sporadic crashes. If that size is set to 256M everything goes smooth (wrt crashes).

Could you please suggest what needs to be done on galcore(?) side to avoid the crashes and what would be the reasonable memory size? We are using Galcore version 6.2.4.190076

Thank you

oleksandr_andru · ‎12-23-2019

Unfortunately, the crashes show up with any memory size we give to galcore.

One way the crashes are reproduced more frequently is to give 3328MB of RAM to the board (u-boot tweaks are required). With this change the crashes are easily reproduced.

sam_raf · ‎12-10-2021

Hi oleksandr_andru

Were you then able to solve this issue? Was it a SW problem?

igorpadykov · ‎07-19-2019

Hi Théo

reason for panic may be ddr errors so one can run ddr test and update image

with new ddr calibration coefficients found from ddr test

i.MX6/7 DDR Stress Test Tool V3.00

Linux 4.19 is not supported by nxp yet, one can try with latest official Linux L4.14.98

linux-imx - i.MX Linux kernel

"problem only occurs when GPU load is close to 100%" also may point to power supplies,

one can check power supplies values and ripples with oscilloscope (should be < 5%).

Best regards
igor
-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

tbueno · ‎07-22-2019

Hi Igor,

I ran the DDR overnight test for 2 days and no error came up. Board is not custom and is pretty widespread (WandBoard 6QuadPlus) so there should not be any memory problem with it.

I was able to confirm that my power supply is stable enough (about 4% variation due to ripples, 200 mV peak-to-peak between 4.95 and 5.15V). I have tried different power supplies just in case regardless.

Using the same linux images, problem could not be reproduced on an i.MX6 Quad. I am trying to get ahold of a different QuadPlus board to run the same test, but the bug looks software to me.

Regards,

Théo.

igorpadykov · ‎07-22-2019

Hi Théo

could you try to reproduce issue on NXP i.MX6QP Sabre SD or AI reference boards

(only these boards are officially supported for i.MX6QP)

i.MX 6QuadPlus SABRE Development Board | NXP

https://www.nxp.com/webapp/Download?colCode=SABREAI6QDPLUSQSG

with Demo Image from

i.MX Software | NXP

Best regards
igor

oleksandr_andru · ‎12-05-2019

Hi,

I am seeing similar behavior on a custom IMX8QM + Android: the board seems to work okay before some graphics load is applied, e.g. Maps.ME is the one that makes the thing panic. I see random crashes which have one thing in common: those are always "address between user and kernel address ranges" and mostly with a pattern:

[ 1200.047638] Unable to handle kernel paging request at virtual address 78a3d60178a3d600
[ 591.645365] Unable to handle kernel paging request at virtual address eaecbf00636d3800
[ 135.177393] Unable to handle kernel paging request at virtual address eaecbf0008bb7dd0

[ 76.281789] Unable to handle kernel paging request at virtual address eaecbf0008241f84

[ 247.539768] Unable to handle kernel paging request at virtual address eaecbf018ee47a30

[ 107.345418] Unable to handle kernel paging request at virtual address eaecbf019cbfd3a0

Please note high address bits (eaecbf0). The stack trace mostly points to galcore, but sometimes to some other random kernel code. I am pretty much sure DDR is not the case here, but it looks like GPU user/kernel space issue.

Any advice here?

Thank you,

Oleksandr

igorpadykov · ‎12-05-2019

issue may be caused by memory (one can test it also with linux memtester)

and board power supplies (check hardware guide for power supplies guidelines).

Try latest NXP Linux L4.19.35_1.1.0
linux-imx - i.MX Linux kernel

Best regards
igor

kernel memory corruption in galcore under high load

kernel memory corruption in galcore under high load

Graphics & Display

i.MX6_All

Linux

Suspected Software Defect

Yocto Project