Hi, NXP experts
I encountered a very strange hang on my custom imx6ull board. My board has 2 RMII ethernet interfaces and 2 USB host interfaces and 1GB DDR3. My hardware design use rmii with 50MHz clk from MAC to PHY. The 1GB DDR3 has been calibrated using NXP DDR tools, and the overnight stress test has been successfully carried out without error. The calibrated DDR parameters have been integrated into u-boot.
Our software is based on L5.4.70_2.3.0 BSP. My board gos well, but the strange hang can be reproducted if ifconfig down both eth0 and eth1 and then wait a small while. I am sure the CPU is hang, because at that time shell can not input any more and hang forever, but if enable imx-watchdog, i can see watchdog reset.
Any of the below condition can not reproduct the hang:
1. Do not ifconfig down both eth0 and eth1.
only ifconfig down one of two do not lead to hang.
2. If ifconfig down both eth0 and eth1, and then quickly run cmd 'top -d 1' , will also not lead to hang.
but will hang soon after quitting the cmd 'top -d 1'.
3. One or both of the two USB interfaces enumerate usb device(eg. a USB drive), will not lead to hang.
if unplug all usb devices will hang soon.
4. Force u-boot using EVK board default 512MB DDR parameters also not lead to hang. That is change the u-boot only.
5. Change linux kernel CPUfreq Governor to 'performance' from 'ondemand' also not lead to hang.
I cannot distinguish whether it is belong to a software issue (such as DDR parameters, MAC/PHY driver) or a hardware issue (such as a power supply system, but hardware team can not find any misbehave supply).
Any suggestion to narrow down the case?
Any help is appreciated.
Hi Changbao
one can try to disable OP-TEE as described in sect.5.6.10 OP-TEE enablement attached Yocto
Guide and then debug it using AN4553 Using Open Source Debugging Tools for Linux on i.MX Processors
https://www.nxp.com/docs/en/application-note/AN4553.pdf
Best regards
igor
Hello, @igorpadykov
Thanks for your reply. In my project OP-TEE have been disabled.
Beside kernel GDB, any other suggestion?
one can verify that uboot imx_v2020.04 version used in the case and try to rebuild all from scratch:
https://source.codeaurora.org/external/imx/uboot-imx/tree/?h=imx_v2020.04_5.4.70_2.3.0
Best regards
igor
Hi, @igorpadykov
I have used NXP yocto project Linux 5.4.70_2.3.0 + Linux 5.4.70_2.3.4 Patch, the u-boot version of which is already imx_v2020.04. And i also tried the old version u-boot 2016, the situation do not make any differences
Best regards
one can try to perform other tests, check for example temperature dependency: heat or cool.
Best regards
igor
The value readed from /sys/devices/virtual/thermal/thermal_zone0/temp is 36500 which seems keep in line with my room temperature environment.
please try to test with heating/cooling board, not at room temperature.
From description this may be caused by ddr errors. So may be suggested to run ddr test
at various temperatures or with memtester.
Best regards
igor
Hello, @igorpadykov
Today i put my imx6ull board(with 1GB DDR) into a environment test chamber with high/low temperature(environment setting is 65~0℃) and do memtest testing. Everything works fine if i do not do ifconfig eth0 and eth1 down. see below:
root@root:~# cat /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
ondemand
root@Orona:~# free
total used free shared buff/cache available
Mem: 1024968 23488 966972 8856 34508 978312
Swap: 0 0 0
root@root:~# memtester 900M 5
memtester version 4.3.0 (32-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).
pagesize is 4096
pagesizemask is 0xfffff000
want 900MB (943718400 bytes)
got 900MB (943718400 bytes), trying mlock ...locked.
Loop 1/5:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
...
Loop 5/5:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok
Done.
High/low temperature&memtest will test continue a overnight . (update: 24Hours pass and cpu runs well )
Memtester test pass at room temperature too.
A month ago, 7x24Hours high/low temperature testing without memtest is also passed.
It seems that the environment temperature do not cause DDR errors and hang the CPU.
Any other suggestions? @igorpadykov
from the description the issue looks more related with DDR,
if the DDR timing/training were not generated properly, it will lead hangs/not work in
CAAM/PMIC/HDMI and other modules randomly.
Best regards
igor
Hi, @igorpadykov
Maybe i have found the root cause. I use the new version v1.1 DDR config tool(https://community.nxp.com/t5/i-MX-Processors-Knowledge-Base/i-MX6UL-ULL-ULZ-DRAM-Register-Programmin... ) to create DDR3 parameters instand of the old version v0.01(https://community.nxp.com/t5/i-MX-Processors-Knowledge-Base/i-MX6ULL-DDR3-Script-Aid/ta-p/1127297). Now with the new DDR3 parameters, i can't reproduce the strange hang.
The main difference of those two DDR3 parameters above is DDR refresh rate, as follow:
I don't know how this difference cause the strange hang.