kernel memory corruption in galcore under high load

取消
显示结果 
显示  仅  | 搜索替代 
您的意思是: 

kernel memory corruption in galcore under high load

4,380 次查看
tbueno
Contributor I

Hi everyone,

I am encoutering an issue with what I suspect is a galcore crash. For testing purposes, I put an i.MX6 QuadPlus under high load (CPU, VPU and GPU).  After a few seconds or minutes of this workload (it's random-ish), I have a kernel Oops or panic that leads to the complete freeze of the system.

I can reproduce this problem consistently with two kernel/graphics drivers versions:

- Community's linux-fslc 4.9 with Vivante 6.2.4

- NXP's latest linux-imx 4.19 with Vivante 6.4.0

I build the OS images myself using Yocto Thud (meta-freescale).

On Linux 4.9, kernel only reports what look to be memory corruptions (invalid addresses), whereas Linux 4.19 is more explict as to the root cause of the crash, by explicitely mentionning galcore in the stack dump.

Furthermore, problem only occurs when GPU load is close to 100% (as measured by "on" cycles). Whatever the CPU/VPU load are, system will not crash until I start an intensive OpenGL app such as glmark2 which pushes the GPU to its limits.

I have attached in text files a few of those crashes for each of the two kernel versions.

Is this a known issue with galcore ? Are there any workarounds to prevent this from happening ?

Thanks and best regards,

Théo Bueno.

0 项奖励
回复
8 回复数

3,592 次查看
oleksandr_andru
Contributor I

Thank you for the suggestions, but it seems that this is rather a software problem then HW: if we set galcore.contiguousSize to 8/32/128M we see those sporadic crashes. If that size is set to 256M everything goes smooth (wrt crashes).

Could you please suggest what needs to be done on galcore(?) side to avoid the crashes and what would be the reasonable memory size? We are using Galcore version 6.2.4.190076

Thank you

0 项奖励
回复

3,593 次查看
oleksandr_andru
Contributor I

Unfortunately, the crashes show up with any memory size we give to galcore.

One way the crashes are reproduced more frequently is to give 3328MB of RAM to the board (u-boot tweaks are required). With this change the crashes are easily reproduced.

0 项奖励
回复

2,741 次查看
sam_raf
Contributor I

Hi oleksandr_andru

Were you then able to solve this issue? Was it a SW problem?

 

0 项奖励
回复

3,593 次查看
igorpadykov
NXP Employee
NXP Employee

Hi Théo

reason for panic may be ddr errors so one can run ddr test and update image

with new ddr calibration coefficients found from ddr test

i.MX6/7 DDR Stress Test Tool V3.00 

Linux 4.19 is not supported by nxp yet, one can try with latest official Linux L4.14.98

linux-imx - i.MX Linux kernel 

"problem only occurs when GPU load is close to 100%" also may point to power supplies,

one can check power supplies values and ripples with oscilloscope (should be < 5%).

Best regards
igor
-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

0 项奖励
回复

3,593 次查看
tbueno
Contributor I

Hi Igor,

I ran the DDR overnight test for 2 days and no error came up. Board is not custom and is pretty widespread (WandBoard 6QuadPlus) so there should not be any memory problem with it.

I was able to confirm that my power supply is stable enough (about 4% variation due to ripples, 200 mV peak-to-peak between 4.95 and 5.15V). I have tried different power supplies just in case regardless.

Using the same linux images, problem could not be reproduced on an i.MX6 Quad. I am trying to get ahold of a different QuadPlus board to run the same test, but the bug looks software to me.

Regards,

Théo.

0 项奖励
回复

3,594 次查看
igorpadykov
NXP Employee
NXP Employee

Hi Théo

could you try to reproduce issue on NXP i.MX6QP Sabre SD or AI reference boards

(only these boards are officially supported for i.MX6QP)

i.MX 6QuadPlus SABRE Development Board | NXP 

https://www.nxp.com/webapp/Download?colCode=SABREAI6QDPLUSQSG 

with Demo Image from

i.MX Software | NXP 

Best regards
igor

0 项奖励
回复

3,594 次查看
oleksandr_andru
Contributor I

Hi,

I am seeing similar behavior on a custom IMX8QM + Android: the board seems to work okay before some graphics load is applied, e.g. Maps.ME is the one that makes the thing panic. I see random crashes which have one thing in common: those are always "address between user and kernel address ranges" and mostly with a pattern:

[ 1200.047638] Unable to handle kernel paging request at virtual address 78a3d60178a3d600
[ 591.645365] Unable to handle kernel paging request at virtual address eaecbf00636d3800
[ 135.177393] Unable to handle kernel paging request at virtual address eaecbf0008bb7dd0

[   76.281789] Unable to handle kernel paging request at virtual address eaecbf0008241f84

[  247.539768] Unable to handle kernel paging request at virtual address eaecbf018ee47a30

[  107.345418] Unable to handle kernel paging request at virtual address eaecbf019cbfd3a0

Please note high address bits (eaecbf0). The stack trace mostly points to galcore, but sometimes to some other random kernel code. I am pretty much sure DDR is not the case here, but it looks like GPU user/kernel space issue.

Any advice here?

Thank you,

Oleksandr

0 项奖励
回复

3,594 次查看
igorpadykov
NXP Employee
NXP Employee

issue may be caused by memory (one can test it also with linux memtester)

and board power supplies (check hardware guide for power supplies guidelines).

Try latest NXP Linux L4.19.35_1.1.0
linux-imx - i.MX Linux kernel 

Best regards
igor

0 项奖励
回复