Ethernet Napi Tx/Rx processing on iMX8mp (kthread starved for 4990 jiffies)

etienne_lorrain · ‎12-17-2021

Hello,

I get "rcu_preempt self-detected stall on CPU" with the following stack dump when trying to broadcast quite a lot of netconsole message (to a non-local network) over a 100 kbytes/s Ethernet.

[ 69.619386] netpoll: netconsole: local port 6665
[ 69.624048] netpoll: netconsole: local IPv4 address 10.2.6.32
[ 69.629806] netpoll: netconsole: interface 'eth0'
[ 69.634533] netpoll: netconsole: remote port 6666
[ 69.639242] netpoll: netconsole: remote IPv4 address 10.255.255.254
[ 69.645518] netpoll: netconsole: remote ethernet address ff:ff:ff:ff:ff:ff
[ 70.708711] ddrc freq set to low bus mode
[ 90.688112] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 90.688116] rcu: 2-...!: (1 GPs behind) idle=58a/1/0x4000000000000002 softirq=3896/3897 fqs=114
[ 90.688118] (t=5250 jiffies g=6721 q=5614)
[ 90.688121] rcu: rcu_preempt kthread starved for 4990 jiffies! g6721 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=1
[ 90.688123] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 90.688125] rcu: RCU grace-period kthread stack dump:
[ 90.688127] task:rcu_preempt state:R running task stack: 0 pid: 11 ppid: 2 flags:0x00000028
[ 90.688133] Call trace:
[ 90.688134] __switch_to+0x100/0x160
[ 90.688135] __schedule+0x25c/0x6d0
[ 90.688137] schedule+0x70/0x104
[ 90.688138] schedule_timeout+0x84/0xfc
[ 90.688140] rcu_gp_kthread+0x4d8/0xaa0
[ 90.688141] kthread+0x154/0x160
[ 90.688143] ret_from_fork+0x10/0x30
[ 90.688144] Task dump for CPU 2:
[ 90.688145] task:modprobe state:R running task stack: 0 pid: 2801 ppid: 802 flags:0x00400002
[ 90.688151] Call trace:
[ 90.688153] dump_backtrace+0x0/0x1d0
[ 90.688154] show_stack+0x18/0x70
[ 90.688155] sched_show_task+0x144/0x170
[ 90.688157] dump_cpu_task+0x44/0x54
[ 90.688158] rcu_dump_cpu_stacks+0xb0/0xf0
[ 90.688160] rcu_sched_clock_irq+0x994/0xc9c
[ 90.688161] update_process_times+0x60/0xa0
[ 90.688163] tick_sched_handle+0x34/0x60
[ 90.688164] tick_sched_timer+0x4c/0xa4
[ 90.688167] __hrtimer_run_queues+0x140/0x1e0
[ 90.688169] hrtimer_interrupt+0xe8/0x2c0
[ 90.688171] arch_timer_handler_phys+0x38/0x50
[ 90.688172] handle_percpu_devid_irq+0x84/0x150
[ 90.688174] __handle_domain_irq+0x7c/0xe0
[ 90.688175] gic_handle_irq+0xc0/0x140
[ 90.688176] el1_irq+0xc4/0x180
[ 90.688178] net_rx_action+0x110/0x440
[ 90.688179] _stext+0x124/0x290
[ 90.688180] do_softirq+0x80/0x90
[ 90.688182] __local_bh_enable_ip+0x8c/0xa0
[ 90.688184] _raw_spin_unlock_bh+0x38/0x60
[ 90.688186] stmmac_napi_poll_tx+0x404/0x614
[ 90.688188] netpoll_poll_dev+0xfc/0x1c0
[ 90.688191] netpoll_send_skb+0x22c/0x290
[ 90.688192] netpoll_send_udp+0x210/0x3b0
[ 90.688195] write_msg+0xf0/0x120 [netconsole]
[ 90.688196] console_unlock+0x36c/0x460
[ 90.688197] register_console+0x174/0x2c0
[ 90.688199] init_netconsole+0x1ac/0x1000 [netconsole]
[ 90.688201] do_one_initcall+0x54/0x1b0
[ 90.688202] do_init_module+0x58/0x250
[ 90.688204] load_module+0x22a0/0x2914
[ 90.688206] __do_sys_init_module+0x210/0x24c
[ 90.688208] __arm64_sys_init_module+0x1c/0x2c
[ 90.688209] el0_svc_common.constprop.0+0x78/0x1a0
[ 90.688211] do_el0_svc_compat+0x1c/0x50
[ 90.688212] el0_svc_compat+0x14/0x20
[ 90.688213] el0_sync_compat_handler+0x90/0x150
[ 90.688215] el0_sync_compat+0x17c/0x180
[ 91.212534] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 2-... } 5312 jiffies s: 293 root: 0x4/.
[ 91.212553] rcu: blocking rcu_node structures:
[ 91.212557] Task dump for CPU 2:
[ 91.212561] task:modprobe state:R running task stack: 0 pid: 2801 ppid: 802 flags:0x00400002
[ 91.212569] Call trace:
[ 91.212583] __switch_to+0x100/0x160
[ 91.212589] 0x9f009a73dc14e400

I am not sure to interpret those logs correctly, but it seem there is a mix of Tx/Rx Ethernet processing using NAPI on the same core and it doesn't feel right, is the Ethernet driver handling all the corner cases?

The result is usually watchdog trigger reboot or some kind of processor freeze.
Not enabling netconsole is a temporary fix, then Ethernet doesn't show any problem, for as long as tested (at least hours).

Best Regards, Etienne.

etienne_lorrain · ‎12-22-2021

I am using BSP version 5.10.35-2.0.0 , with a board built in-house inside the company.

Hopefully you can reproduce with any iMX8mp board having a working Ethernet, maybe even with latest BSP.
Connect that network at 100 Mb/s, even if the network is not on 10.255.255.255 network (my configuration is not connected to such network).
Then check the Linux kernel has "CONFIG_NETCONSOLE=m".

Then executing:

static_ip=10.45.62.149
syslogsrv_iface=eth0
syslogsrv_ip=10.255.255.254
syslogsrv_mac=ff:ff:ff:ff:ff:ff
modprobe netconsole netconsole="6665@$static_ip/$syslogsrv_iface,6666@$syslogsrv_ip/$syslogsrv_mac"
Then generate a bit of log:
{ logread 2>/dev/null || cat /var/log/messages; } | logger -t "rc.syslog" -p debug

That should be sufficient.
I am sorry I will be unresponsive during the holidays season.

Have a good Xmas and new year.

etienne_lorrain · ‎01-18-2022

Any news about reproducing the problem? Do it only affect me?

etienne_lorrain · ‎01-05-2022

Are you able to reproduce, or is it a local problem of mine?

Note that my local board do not even have 10.45.62.149 defined in ifconfig:
# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:80:0f:2a:25:ac brd ff:ff:ff:ff:ff:ff
inet 169.254.65.225/16 brd 169.254.255.255 scope link eth0
valid_lft forever preferred_lft forever
inet 192.168.1.38/24 brd 192.168.1.255 scope global eth0:dhcp
valid_lft forever preferred_lft forever

Best Regards, and happy new year!

jimmychan · ‎12-21-2021

Hello,

Which version of BSP are you using?

Which board are you using?

How you test it? so we may try to reproduce it on our board.

Best regards,

Jimmy

jobs · ‎02-18-2022

I had a similar problem,but it is cpu stall.

MPC85XX T2080 E6500 RCU检测CPU X卡顿 - 恩智浦社区 (nxp.com)