Hi everyone,
we have an imx8qm based board that we're having issues when the real time patch is applied.
Namely we discovered that during boot, sometimes the board becomes unresponsive. We're running Yocto build with 5.15.72 kernel but we've observed same behavior on 5.15.32.
After running reboot tests, it usually happens after 7 or 8 reboots, sometimes sooner.
Here's an example of the kernel panic at the boot
[ 46.451338] sched: RT throttling activated [ 56.720382] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 56.720391] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-5): P166/1:b..l [ 56.720412] (detected by 2, t=5252 jiffies, g=917, q=318) [ 56.720422] task:kworker/5:4 state:R running task stack: 0 pid: 166 ppid: 2 flags:0x00000008 [ 56.720436] Workqueue: events fec_time_keep [ 56.720458] Call trace: [ 56.720460] __switch_to+0x108/0x160 [ 56.720475] __schedule+0x240/0x690 [ 56.720490] preempt_schedule_irq+0x48/0x150 [ 56.720498] el1_interrupt+0x60/0x80 [ 56.720508] el1h_64_irq_handler+0x1c/0x2c [ 56.720517] el1h_64_irq+0x78/0x7c [ 56.720525] fec_ptp_read+0x24/0x90 [ 56.720534] timecounter_read+0x24/0x70 [ 56.720547] fec_time_keep+0x7c/0x90 [ 56.720556] process_one_work+0x1d0/0x354 [ 56.720565] worker_thread+0x134/0x45c [ 56.720573] kthread+0x18c/0x1a0 [ 56.720585] ret_from_fork+0x10/0x20 [ 56.720595] rcu: rcu_preempt kthread timer wakeup didn't happen for 1155 jiffies! g917 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 [ 56.720604] rcu: Possible timer handling issue on cpu=1 timer-softirq=184 [ 56.720609] rcu: rcu_preempt kthread starved for 1156 jiffies! g917 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1 [ 56.720618] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. [ 56.720622] rcu: RCU grace-period kthread stack dump: [ 56.720625] task:rcu_preempt state:I stack: 0 pid: 11 ppid: 2 flags:0x00000008 [ 56.720635] Call trace: [ 56.720637] __switch_to+0x108/0x160 [ 56.720646] __schedule+0x240/0x690 [ 56.720657] schedule+0xa0/0x154 [ 56.720667] schedule_timeout+0x80/0xf0 [ 56.720678] rcu_gp_fqs_loop+0x118/0x2e0 [ 56.720688] rcu_gp_kthread+0x108/0x120 [ 56.720696] kthread+0x18c/0x1a0 [ 56.720705] ret_from_fork+0x10/0x20 [ 56.720714] rcu: Stack dump where RCU GP kthread last ran: [ 56.720717] Task dump for CPU 1: [ 56.720720] task:systemd-udevd state:R running task stack: 0 pid: 527 ppid: 396 flags:0x00000a00 [ 56.720732] Call trace: [ 56.720734] __switch_to+0x108/0x160 [ 56.720744] down_write+0x18/0x24 [ 56.720751] rt_spin_lock+0x48/0xa0 [ 56.720759] wp_page_copy+0x124/0x7a0 [ 56.720771] do_wp_page+0x98/0x3c0 [ 56.720780] __handle_mm_fault+0x5c0/0x9a0 [ 56.720789] handle_mm_fault+0xc4/0x1d0 [ 56.720799] do_page_fault+0x14c/0x3b4 [ 56.720810] do_mem_abort+0x44/0xbc [ 56.720819] el0_da+0x24/0x60 [ 56.720827] el0t_64_sync_handler+0xec/0x130 [ 56.720836] el0t_64_sync+0x1a0/0x1a4
and after that, board usually becomes unresponsive and everything else breaks apart like communication with SCU, dmesg prints get stuck at half the message etc.
We discovered that mounting a Flash partition with JFFS2 file system exaggerates this issue presumably because JFFS2 driver reads the whole partition over SPI, generates lots of interrupts and calculates the file indexes and stuff.
We noticed that most of the kernel panics we were having mention CPU 5 which is the second A72 core.
After disabling A72 cores in the device tree the board suddenly becomes much more stable and these types of issues seem to go away.
Can anyone shed some light on what could be the possible cause of this?
We've seen that there's an errata ERR050104 for the cache coherency issue, could this be the culprit?
linux-imx has TKT340553 workaround. Is this the same as ERR050104?
There's a patch for ERR050104 available here
https://lore.kernel.org/linux-arm-kernel/ZDflS%2FCnEx8iCspk@FVFF77S0Q05N/T/
but it seems to do similar thing as TKT340553.
Thank you,
Darko
Hi, I'm not working anymore on the project with IMX8QM so I don't have any news apart that disabling the A72 cores makes the RT patch much more stable.
In the meantime, the NXP released lf-6.1.y and lf-6.6.y LTS kernels so it might be a good idea to try those out to see if there's any change in the behavior.