Dear community,
I have a custom board with an i.MX8 DualX. U-Boot is running and the linux kernel is able to boot. I use Linux 5.10.72_2.2.0 via Yocto together with SCFW Porting Kit 1.11.0.
Unfortunately, the kernel always crashes after several seconds up to half an hour with different panic messages. Mostly, the message is something like the following:
Unable to handle kernel paging request at virtual address 000000000000698b
[ 20.370464] Mem abort info:
[ 20.373259] ESR = 0x96000004
[ 20.376319] EC = 0x25: DABT (current EL), IL = 32 bits
[ 20.381633] SET = 0, FnV = 0
[ 20.384692] EA = 0, S1PTW = 0
[ 20.387835] Data abort info:
[ 20.390712] ISV = 0, ISS = 0x00000004
[ 20.394552] CM = 0, WnR = 0
[ 20.397526] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000085c66000
[ 20.403970] [000000000000698b] pgd=0000000000000000, p4d=0000000000000000
[ 20.410775] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[ 20.416349] Modules linked in:
[ 20.419414] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.72-lts-5.10.y+ga68e31b63f86 #1
[ 20.427599] Hardware name: Freescale i.MX8DX MEK (DT)
[ 20.432660] pstate: 80000085 (Nzcv daIf -PAN -UAO -TCO BTYPE=--)
[ 20.438682] pc : calc_global_load+0x18c/0x210
[ 20.443041] lr : calc_global_load+0x178/0x210
[ 20.447398] sp : ffff800011d5bee0
[ 20.450717] x29: ffff800011d5bee0 x28: ffff800011b52380
[ 20.456042] x27: ffff800011b52380 x26: ffff800011d5c000
[ 20.461367] x25: ffff800011d58000 x24: ffff800011b49360
[ 20.466693] x23: ffff800011cee000 x22: ffff800011cee000
[ 20.472018] x21: ffff800011b46000 x20: ffff800011b46a00
[ 20.477344] x19: 00000004b75a0aee x18: 0000000000000000
[ 20.482669] x17: 0000000000000000 x16: 0000000000000000
[ 20.487995] x15: 0000000fee30533a x14: 00000000000215a2
[ 20.493320] x13: 00000000000007f5 x12: 00000000fffef377
[ 20.498645] x11: 00000000000060cb x10: ffff800011cc88e0
[ 20.503971] x9 : 00000000fffef85a x8 : 0000000000000042
[ 20.509296] x7 : ffff800011cc88c0 x6 : 00000000000000c7
[ 20.514621] x5 : 00000000003fa800 x4 : 000000000002ad29
[ 20.519947] x3 : 0000000000000000 x2 : 0000000000000800
[ 20.525272] x1 : 00000000000004e3 x0 : 0000000000000055
[ 20.530598] Call trace:
[ 20.533055] calc_global_load+0x18c/0x210
[ 20.537076] do_timer+0x20/0x30
[ 20.540222] tick_do_update_jiffies64.part.0+0x78/0x114
[ 20.545449] tick_irq_enter+0xf0/0x130
[ 20.549203] irq_enter_rcu+0x64/0x70
[ 20.552780] irq_enter+0x14/0x20
[ 20.556014] __handle_domain_irq+0x40/0xe0
[ 20.560114] gic_handle_irq+0xc0/0x140
[ 20.563867] el1_irq+0xcc/0x180
[ 20.567014] arch_cpu_idle+0x18/0x30
[ 20.570591] default_idle_call+0x24/0x6c
[ 20.574518] do_idle+0x230/0x2a0
[ 20.577749] cpu_startup_entry+0x24/0x70
[ 20.581675] rest_init+0xd8/0xe8
[ 20.584909] arch_call_rest_init+0x10/0x1c
[ 20.589007] start_kernel+0x4ac/0x4e4
[ 20.592682] Code: d2809c61 9b013129 f90004e9 d5033abf (b948c160)
[ 20.598788] ---[ end trace 5863192a640cb186 ]---
[ 20.603411] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[ 20.610290] SMP: stopping secondary CPUs
The "virtual address" is not always the same. Also, the call trace is not always the same, but mostly, the last function is something timer-related.
Sometimes, the panic message is different:
[ 20.901152] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000060
[ 20.909965] Mem abort info:
[ 20.912759] ESR = 0x96000004 ...
Two complete boot logs are attached.
I use 2*512MB (1GB) of DDR3L memory. The DDR stress test was successfully executed for 2 hours, which is why I think, a hardware issue is improbable.
The memory node in the device tree is:
memory@80000000 {
device_type = "memory";
reg = <0x00000000 0x40000000>;
};
We checked the DCD file several times and did not find any wrong configurations.
RAM-Config in U-Boot is the following:
#define CONFIG_SYS_SDRAM_BASE 0x80000000
#define PHYS_SDRAM_1 0x80000000
#define PHYS_SDRAM_2 0x880000000
#define PHYS_SDRAM_1_SIZE 0x40000000 /* 1 GB */
#define PHYS_SDRAM_2_SIZE 0x00000000 /* 0 GB */
and
CONFIG_NR_DRAM_BANKS=4
The performance was improved a little bit by including CONFIG_DEBUG_PAGEALLOC=y, I think.
What could be the problem? What else could I try?
Regards,
Tobi