[88W9098] driver crash when scanning continuously

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

[88W9098] driver crash when scanning continuously

3,105 Views
martinetd
Contributor IV

Hello,


(note: this issue looks a bit like https://community.nxp.com/t5/Wireless-Connectivity/88W9098-firmware-bug-when-scanning-regularly-pack... and the root cause might be shared, but as the reproducer and behavior are different I'm opening a new thread. Feel free to consider this a non-question and keep interacting only in either thread if that helps.)


We've been testing an AW-XM458 wireless module (88W9098) on our i.MX8 MP board and have been chasing this station problem for a while with no success: transfers work for a while but after a few days nothing works anymore, scans are refused, adapter is disconnected, everything is down.
The reproducing rate is low so i did not get to try many workarounds, but unbinding then binding the pci devices in /sys/bus/pci/drivers/wlan_pcie worked at least once, so the adapter isn't completely dead? but that triggers quite a bit of init code so it is not clear what fixed the issue, nor if it works all the time.


We reproduced this in the following setup:

- wpa_supplicant connection to AP on mlan0 (in this case 2.4GHz, confirmed similar bug with 5GHz), no wpa_supplicant option (config just defines network block with WPA2/CCMP TKIP available in attachment)

- running curl in a loop to simulate network activity, we've also reproduced with just pings. Did not try to not send anything as there will always be at least arp noise.

- `iw mlan0 scan` in a tight loop (there is a "cooldown" during which a new scan doesn't work immediately, but running every 5 or 10 seconds worked). I've tried restricting the scan to a single 2.4GHz frequency or a single 5GHz frequency and both reproduce.

(the difference with the other thead here is the scan interval - the other problem happens when we scan every 300s while this happens with scans every few seconds; I've confirmed it with up to 30s but did not try any value between 30 and 300)


With that setup the board can run transfers successfully for anywhere from a few hours to a few weeks, but generally after 2-3 days we see a crash such as this one:

 

[99438.023197] wlan: mlan0 START SCAN
[99459.413474] Timeout cmd id (99459.410831) = 0x107, act = 0x0
[99459.419152] 07 
[99459.419155] 01 
[99459.420909] cc 
[99459.422694] 00 
[99459.424446] 0a 
[99459.426215] 00 
[99459.430825] 00 
[99459.435433] 00 
[99459.438307] 00 
[99459.440057] 00 
[99459.441815] 00 
[99459.443566] 00 
[99459.445316] 12 
[99459.447074] 01 
[99459.448824] 01 
[99459.450585] 00 

[99459.455587] Dump Rx CMD Response Buf:
[99459.459263] CmdResp Buf:
[99459.459271] 14 00 01 00 07 01 10 00 09 00 00 00 00 00 00 00 
[99459.459278] c6 01 00 00 10 00 74 00 dd 69 00 50 f2 04 10 4a 
[99459.459290] 00 01 10 10 3a 00 01 00 10 08 00 02 31 08 10 47 
[99459.459297] 00 10 bd 41 c1 02 c9 5f 5e 31 8f 01 56 0a 08 c5 
[99459.459305] BSS type = 0 BSS role= 0
[99459.462893] ------------Dump info-----------
[99459.467173] Commmand Timeout
[99459.470064] mlan_processing =0
[99459.473117] main_lock_flag =0
[99459.476102] main_process_cnt =36113599
[99459.479873] delay_task_flag =0
[99459.482941] mlan_rx_processing =0
[99459.486267] rx_pkts_queued=0
[99459.489147] more_task_flag = 0
[99459.492215] num_cmd_timeout = 1
[99459.495370] last_cmd_index = 3
[99459.498439] last_cmd_id = 
[99459.498443] 0x107 
[99459.501149] 0xf2 
[99459.503213] 0x107 
[99459.505141] 0x107 
[99459.507178] 0x107 
[99459.509191] 0x107 
[99459.511217] 0x107 
[99459.513230] 0x107 
[99459.515254] 0x107 
[99459.517267] 0x107 

[99459.522801] last_cmd_act = 
[99459.522803] 0x0 
[99459.525608] 0x1 
[99459.527447] 0x0 
[99459.529284] 0x0 
[99459.531136] 0x0 
[99459.532973] 0x0 
[99459.534826] 0x0 
[99459.536663] 0x0 
[99459.538514] 0x0 
[99459.540351] 0x0 

[99459.545542] last_cmd_resp_index = 2
[99459.549028] last_cmd_resp_id = 
[99459.549030] 0x8107 
[99459.552182] 0x80f2 
[99459.554295] 0x8107 
[99459.556393] 0x8107 
[99459.558500] 0x8107 
[99459.560600] 0x8107 
[99459.562713] 0x8107 
[99459.564811] 0x8107 
[99459.566923] 0x8107 
[99459.569022] 0x8107 

[99459.574731] last_event_index = 1
[99459.577971] last_event = 
[99459.577975] 0x58 
[99459.580594] 0x58 
[99459.582533] 0x58 
[99459.584458] 0x58 
[99459.586396] 0x42 
[99459.588322] 0x42 
[99459.590260] 0x58 
[99459.592187] 0x58 
[99459.594125] 0x42 
[99459.596050] 0x58 

[99459.601401] num_data_h2c_failure = 0
[99459.605026] num_cmd_h2c_failure = 0
[99459.608569] num_alloc_buffer_failure = 0
[99459.612512] num_pkt_dropped = 0
[99459.615665] num_no_cmd_node = 0
[99459.618818] num_event_deauth = 1
[99459.622057] num_event_disassoc = 0
[99459.625468] num_event_link_lost = 0
[99459.628956] num_cmd_deauth = 0
[99459.632026] num_cmd_assoc_success = 2
[99459.635701] num_cmd_assoc_failure = 0
[99459.639372] num_cons_assoc_failure = 0
[99459.643133] cmd_resp_received=0
[99459.646291] event_received=0
[99459.649171] max_tx_buf_size=4096
[99459.652413] tx_buf_size=0
[99459.655044] curr_tx_buf_size=3103
[99459.658369] data_sent=0 cmd_sent=1
[99459.661783] ps_mode=1 ps_state=0
[99459.665010] wakeup_dev_req=0 wakeup_tries=0 wakeup_timeout=0
[99459.670679] hs_configured=0 hs_activated=0
[99459.674789] pps_uapsd_mode=0 sleep_pd=0
[99459.678639] tx_lock_flag = 0
[99459.681530] scan_processing = 1
[99459.684669] bypass_pkt_count=0
[99459.687737] txbd_rdptr=0x9c6 txbd_wrptr=0x9c6
[99459.692107] rxbd_rdptr=0x10a5 rxbd_wrptr=0x1125
[99459.696648] evtbd_rdptr=0x164d evt_wrptr=0x1655
[99459.701190] last_wr_index:70
[99459.704081]  txrx_bd_size = 128
[99459.707234] Tx pkt size:
[99459.709806] 0096 
[99459.709812] 0096 
[99459.711737] 0096 
[99459.713697] 0096 
[99459.715624] 0096 
[99459.717567] 0096 
[99459.719495] 0096 
[99459.721419] 0096 
[99459.723357] 0096 
[99459.725284] 0096 
[99459.727221] 0096 
[99459.729147] 0096 
[99459.731084] 0096 
[99459.733010] 0096 
[99459.734946] 0096 
[99459.736872] 0096 

[99459.742231] 0096 
[99459.742233] 0096 
[99459.744157] 0096 
[99459.746095] 0096 
[99459.748022] 0096 
[99459.749957] 0096 
[99459.751882] 0096 
[99459.753818] 0096 
[99459.755744] 0096 
[99459.757681] 0096 
[99459.759605] 0096 
[99459.761542] 0096 
[99459.763468] 0096 
[99459.765392] 0096 
[99459.767331] 0096 
[99459.769256] 0096 

[99459.774620] 0096 
[99459.774623] 0096 
[99459.776546] 0096 
[99459.778483] 0096 
[99459.780409] 0096 
[99459.782348] 0096 
[99459.784273] 0096 
[99459.786210] 0096 
[99459.788136] 0096 
[99459.790073] 0096 
[99459.791995] 0096 
[99459.793932] 0096 
[99459.795856] 0096 
[99459.797795] 0096 
[99459.799719] 0096 
[99459.801657] 0096 

[99459.807005] 0096 
[99459.807007] 0096 
[99459.808930] 0096 
[99459.810872] 0096 
[99459.812797] 0096 
[99459.814764] 0096 
[99459.816691] 0096 
[99459.818667] 0096 
[99459.820595] 0096 
[99459.822542] 0096 
[99459.824467] 0096 
[99459.826405] 0096 
[99459.828331] 0096 
[99459.830272] 0096 
[99459.832200] 0096 
[99459.834137] 0096 

[99459.839487] 0096 
[99459.839489] 0096 
[99459.841412] 0076 
[99459.843352] 0076 
[99459.845276] 0076 
[99459.847215] 0080 
[99459.849139] 0128 
[99459.851079] 0128 
[99459.853005] 0128 
[99459.854943] 0128 
[99459.856868] 0128 
[99459.858805] 0128 
[99459.860729] 0128 
[99459.862667] 0128 
[99459.864593] 0128 
[99459.866529] 0128 

[99459.871879] 0128 
[99459.871881] 0128 
[99459.873821] 0128 
[99459.875746] 0128 
[99459.877685] 0128 
[99459.879609] 0128 
[99459.881544] 0128 
[99459.883467] 0128 
[99459.885390] 0128 
[99459.887328] 0128 
[99459.889256] 0128 
[99459.891196] 0128 
[99459.893120] 0096 
[99459.895059] 0096 
[99459.896984] 0096 
[99459.898924] 0096 

[99459.904276] 0096 
[99459.904278] 0096 
[99459.906217] 0096 
[99459.908141] 0096 
[99459.910079] 0096 
[99459.912002] 0096 
[99459.913942] 0096 
[99459.915864] 0096 
[99459.917805] 0096 
[99459.919730] 0096 
[99459.921696] 0096 
[99459.923624] 0096 
[99459.925590] 0096 
[99459.927520] 0096 
[99459.929457] 0096 
[99459.931386] 0096 

[99459.936736] 0096 
[99459.936738] 0096 
[99459.938676] 0096 
[99459.940601] 0096 
[99459.942539] 0096 
[99459.944464] 0096 
[99459.946400] 0096 
[99459.948324] 0096 
[99459.950262] 0096 
[99459.952185] 0096 
[99459.954124] 0096 
[99459.956049] 0096 
[99459.957986] 0096 
[99459.959910] 0096 
[99459.961846] 0096 
[99459.963770] 0096 

[99459.969121] bss_index = 0, tx_pkts_queued = 0
[99459.973496] bss_index = 1, tx_pkts_queued = 0
[99459.977862] bss_index = 2, tx_pkts_queued = 0
[99459.982228] -------- Dump info End---------
[99459.986428] Driver version = PCIE9098--17.92.5.p11-MM5X17283.p2-GPL-(FP92) 
[99459.993405] main_state = 4
[99459.996121] ioctl_pending = 1
[99459.999105] tx_pending = 0
[99460.001825] wmm_tx_pending[0] = 0
[99460.005138] wmm_tx_pending[1] = 0
[99460.008463] wmm_tx_pending[2] = 0
[99460.011788] wmm_tx_pending[3] = 0
[99460.015114] rx_pending = 0
[99460.017832] lock_count = 62
[99460.020624] malloc_count = 62
[99460.023602] mbufalloc_count = 136
[99460.026968] malloc_cons_count = 4
[99460.030325] hs_skip_count = 0
[99460.033294] hs_force_count = 0
[99460.036362] Media state = "Connected"
[99460.040038] carrier on
[99460.042410] tx queue 0: started
[99460.045564] tx queue 1: started
[99460.048704] tx queue 2: started
[99460.051857] tx queue 3: started
[99460.055010] mlan0: num_tx_timeout = 0
[99460.058683] uap0: num_tx_timeout = 0
[99460.062274] wfd0: num_tx_timeout = 0
[99460.065864] Start to process hanging
[99460.069543] Cancel all pending cmd!
[99460.069559] Block woal_cfg80211_del_key in abnormal driver state
[99460.073039] IOCTL failed: 0000000035f51683 id=0x10000, sub_id=0x10003 action=1, status_code=0x80000007
[99460.079261] Block woal_cfg80211_del_key in abnormal driver state
[99460.094527] Block woal_cfg80211_del_key in abnormal driver state
[99460.100564] Block woal_cfg80211_del_key in abnormal driver state
[99460.106588] Block woal_cfg80211_del_key in abnormal driver state
[99460.112948] Block woal_cfg80211_del_key in abnormal driver state
[99460.120045] get fw info failed! status=-1, error_code=0x0
[99460.125502] 11D: Error setting domain info in FW
[99460.130176] 11D: Error setting domain info in FW
[99460.134857] get fw info failed! status=-1, error_code=0x0
[99460.140306] 11D: Error setting domain info in FW
[99460.144942] 11D: Error setting domain info in FW
[99460.149647] Fail to set beacon proberesp assoc probereq IES
[99460.155293] woal_do_scan fails!
[99460.161495] Try configure remain on channel again, ret=-14
[99460.264361] Fail to set beacon proberesp assoc probereq IES
[99460.269968] woal_do_scan fails!
[99460.273265] Fail to set EXTCAP IE
[99460.276647] Fail to set beacon proberesp assoc probereq IES
[99460.282256] Fail to set scan request IE
[99460.379406] Fail to set beacon proberesp assoc probereq IES
[99460.385017] woal_do_scan fails!
[99460.494454] Fail to set beacon proberesp assoc probereq IES
[99460.500061] woal_do_scan fails!

 

 

This happens on our board, but I've been able to reproduce it on imx8mp-evk with BSP userland and almost-BSP kernel (our evk is old and requires a patch to have pci working, we reproduce it without that patch on our board):

kernel: https://github.com/atmark-techno/linux-5.10-at/commits/nxp_5.10.72-2.2.0-repro
wifi driver: http://source.codeaurora.org/external/imx/mwifiex lf-5.10.72-2.2.2
firmware: https://github.com/NXP/imx-firmware/tree/lf-5.10.72_2.2.0/nxp/FwImage_9098_PCIE ; in dmesg PCIE9098--17.92.5.p11-MM5X17283.p2-GPL-(FP92)


I've attached all the traces I could think of; you'll find in the archive:

- setup_mlan_scan_test.sh and curl_scan_parallel.sh are our reproducer script. running this for a few days trigger the issue reliably for us.
- dmesg.log : output of dmesg -w for the whole duration of the test
- curl.log : output of curl_scan_parallel.sh for the whole duration of the test
- serial_link.log output of the serial connection for the whole duration of the test; it's mostly dmesg/curl output mixed in but at the start you'll also have the setup logs and our wpa_supplicant config being output there.
- proc/ is a copy of /proc/mwlan after having produced the bug. proc/adapter0/drv_dump might be useful. I was surprised to see fw_dump is empty as dmesg talks about dumping the firmware, but I could not get any firmware data on this including trying to force a dump with mlanutl (hostcmd fails with config in other topic), so this is not as simple.

 

Looking at the code/driver state, once we get there all driver reqests just fail on woal_request_ioctl() with「IOCTL is not allowed while the device is not present or hang」 because driver_Status is set.
It appears to have been set in the following sequence: wlan_cmd_timeout_func() -> wlan_recv_event() -> moal_recv_event(MLAN_EVENT_ID_DRV_DBG_DUMP) -> priv>phandle->driver_status = MTRUE

 

Note that I could also get similar crashes to happen on a 5GHz AP with no scan happening at all, with the following dmesg this time:

 

[ 8164.497278] ------------[ cut here ]------------
[ 8164.497302] NETDEV WATCHDOG: mlan0 (wlan_pcie): transmit queue 2 timed out
[ 8164.497370] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:485 dev_watchdog+0x34c/0x354
[ 8164.497373] Modules linked in:
[ 8164.497383] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.72-00057-g14d950bce0d7-dirty #14
[ 8164.497387] Hardware name: Atmark-Techno Armadillo-IoT Gateway G4 Board (DT)
[ 8164.497392] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
[ 8164.497398] pc : dev_watchdog+0x34c/0x354
[ 8164.497403] lr : dev_watchdog+0x34c/0x354
[ 8164.497406] sp : ffff80001300bd60
[ 8164.497411] x29: ffff80001300bd60 x28: ffff800011145160
[ 8164.497419] x27: 00000000ffffffff x26: ffff800012611000
[ 8164.497429] x25: ffff00000825a440 x24: ffff000004aa7d00
[ 8164.497436] x23: ffff000004aa7c40 x22: ffff00000825a39c
[ 8164.497446] x21: ffff800012bf7000 x20: ffff00000825a000
[ 8164.497453] x19: 0000000000000002 x18: ffffffffffffffff
[ 8164.497460] x17: 0000000000000000 x16: 0000000000000000
[ 8164.497469] x15: ffffffffffffffff x14: ffff800012f745e8
[ 8164.497477] x13: fffffffffffd4017 x12: ffff800012c84058
[ 8164.497486] x11: ffffffffffff4018 x10: ffff800012c84058
[ 8164.497494] x9 : ffff8000100ab898 x8 : 00000000ffffefff
[ 8164.497502] x7 : ffff800012c84058 x6 : 0000000000000000
[ 8164.497509] x5 : 0000000000000000 x4 : 0000000000000000
[ 8164.497518] x3 : 00000000ffffffff x2 : 0000000000000000
[ 8164.497525] x1 : 0000000000000000 x0 : ffff800012c0e7c0
[ 8164.497533] Call trace:
[ 8164.497539]  dev_watchdog+0x34c/0x354
[ 8164.497547]  call_timer_fn+0x3c/0x1f4
[ 8164.497553]  __run_timers.part.0+0x22c/0x2f0
[ 8164.497558]  run_timer_softirq+0x44/0x7c
[ 8164.497563]  __do_softirq+0x128/0x3dc
[ 8164.497569]  irq_exit+0xe8/0x110
[ 8164.497575]  __handle_domain_irq+0x88/0xe0
[ 8164.497580]  gic_handle_irq+0xc8/0x148
[ 8164.497584]  el1_irq+0xcc/0x180
[ 8164.497590]  cpuidle_enter_state+0xbc/0x444
[ 8164.497594]  cpuidle_enter+0x40/0x5c
[ 8164.497601]  do_idle+0x220/0x2b0
[ 8164.497608]  cpu_startup_entry+0x2c/0x60
[ 8164.497614]  rest_init+0xdc/0xec
[ 8164.497621]  arch_call_rest_init+0x18/0x24
[ 8164.497628]  start_kernel+0x588/0x5c0
[ 8164.497631] ---[ end trace 3abe7692cf9abf0c ]---
[ 8164.497638] 4296933312 : mlan0 (bss=0): Tx timeout (1)
[ 8175.504919] 4296936064 : mlan0 (bss=0): Tx timeout (2)
[ 8186.256576] 4296938752 : mlan0 (bss=0): Tx timeout (3)
[ 8196.496269] 4296941312 : mlan0 (bss=0): Tx timeout (4)
[ 8207.503920] 4296944064 : mlan0 (bss=0): Tx timeout (5)
[ 8218.255578] 4296946752 : mlan0 (bss=0): Tx timeout (6)
[ 8228.495255] 4296949312 : mlan0 (bss=0): Tx timeout (7)
[ 8239.502919] 4296952064 : mlan0 (bss=0): Tx timeout (8)
[ 8250.254581] 4296954752 : mlan0 (bss=0): Tx timeout (9)
[ 8260.493956] 4296957312 : mlan0 (bss=0): Tx timeout (10)
[ 8260.499224] ------------mlan_debug_info-------------
[ 8260.504205] mlan_processing =0
[ 8260.507265] main_lock_flag =0
[ 8260.510240] main_process_cnt =1402184
[ 8260.513900] delay_task_flag =0
[ 8260.516962] mlan_rx_processing =0
[ 8260.520282] rx_pkts_queued=0
[ 8260.523173] tx_pkts_queued=0
[ 8260.526058] fw_hang_report = 0
[ 8260.529110] num_cmd_timeout = 0
[ 8260.532263] Timeout cmd id = 0x0, act = 0x0
[ 8260.536451] last_cmd_index = 7
[ 8260.539517] last_cmd_id = 0x107 0x107 0x107 0x107 0xa4 0xb 0x16 0x7f 0x107 0x107
[ 8260.547006] last_cmd_act = 0x1 0x1 0x1 0x1 0x0 0x0 0x0 0x0 0x1 0x1
[ 8260.553282] last_cmd_resp_index = 7
[ 8260.556778] last_cmd_resp_id = 0x8107 0x8107 0x8107 0x8107 0x80a4 0x800b 0x8016 0x807f 0x8107 0x8107
[ 8260.566006] last_event_index = 8
[ 8260.569236] last_event = 0x7f 0x55 0x55 0x58 0x55 0x55 0x58 0x7f 0x55 0x58
[ 8260.569239] num_data_h2c_failure = 0
[ 8260.579783] num_cmd_h2c_failure = 0
[ 8260.583280] num_alloc_buffer_failure = 0
[ 8260.587206] num_pkt_dropped = 0
[ 8260.590355] txbd_rdptr=0xbbf txbd_wrptr=0xc3f
[ 8260.594716] rxbd_rdptr=0x1283 rxbd_wrptr=0x1303
[ 8260.599253] eventbd_rdptr=0x33e event_wrptr=0x346
[ 8260.603961] last_wr_index:63
[ 8260.606850] TxRx BD size:128
[ 8260.609729] num_event_deauth = 0
[ 8260.612964] num_event_disassoc = 0
[ 8260.616370] num_event_link_lost = 0
[ 8260.619865] num_cmd_deauth = 1
[ 8260.622922] num_cmd_assoc_success = 2
[ 8260.626590] num_cmd_assoc_failure = 0
[ 8260.630256] num_cons_assoc_failure = 0
[ 8260.634015] cmd_resp_received = 0
[ 8260.637327] event_received = 0
[ 8260.640389] max_tx_buf_size = 4096
[ 8260.643793] tx_buf_size = 3103
[ 8260.646855] curr_tx_buf_size = 3103
[ 8260.650347] bypass_pkt_count=672
[ 8260.653573] data_sent=1 cmd_sent=0
[ 8260.656984] data_sent_cnt=1554
[ 8260.660043] ps_mode=1 ps_state=0
[ 8260.663281] wakeup_dev_req=0 wakeup_tries=0 wakeup_timeout=0
[ 8260.668945] hs_configured=0 hs_activated=0
[ 8260.673050] pps_uapsd_mode=0 sleep_pd=0
[ 8260.676889] tx_lock_flag = 0
[ 8260.679779] port_open = 1
[ 8260.682403] scan_processing = 0
[ 8260.685542] txbd: rdptr=0xbbf wrptr=0xc3f
[ 8260.689559] rxbd: rdptr=0x1283 wrptr=0x1303
[ 8260.693745] eventbd: rdptr=0x33e wrptr=0x346
[ 8260.698024] ------------mlan_debug_info End-------------
[ 8260.703343] Driver version = PCIE9098--17.92.5.p11-MM5X17283.p2-GPL-(FP92)
[ 8260.710312] main_state = 4
[ 8260.713017] ioctl_pending = 0
[ 8260.715992] tx_pending = 800
[ 8260.718878] wmm_tx_pending[0] = 0
[ 8260.722201] wmm_tx_pending[1] = 0
[ 8260.725514] wmm_tx_pending[2] = 800
[ 8260.729011] wmm_tx_pending[3] = 0
[ 8260.732330] rx_pending = 0
[ 8260.735043] lock_count = 62
[ 8260.737834] malloc_count = 61
[ 8260.740809] mbufalloc_count = 136
[ 8260.744127] malloc_cons_count = 4
[ 8260.747450] hs_skip_count = 0
[ 8260.750421] hs_force_count = 0
[ 8260.753473] Media state = "Connected"
[ 8260.757143] carrier on
[ 8260.759508] tx queue 0: started
[ 8260.762657] tx queue 1: started
[ 8260.765796] tx queue 2: stopped
[ 8260.768945] tx queue 3: started
[ 8260.772089] mlan0: num_tx_timeout = 10
[ 8260.775846] uap0: num_tx_timeout = 0
[ 8260.779426] wfd0: num_tx_timeout = 0
[ 8260.783009] Start to process hanging
[ 8260.786827] Cancel all pending cmd!
[ 8260.786921] Block woal_cfg80211_del_key in abnormal driver state
[ 8260.793073] Block woal_cfg80211_del_key in abnormal driver state
[ 8260.799118] Block woal_cfg80211_del_key in abnormal driver state
[ 8260.805250] Block woal_cfg80211_del_key in abnormal driver state
[ 8260.811276] Block woal_cfg80211_del_key in abnormal driver state
[ 8260.817286] Block woal_cfg80211_del_key in abnormal driver state
[ 8260.824945] get fw info failed! status=-1, error_code=0x0
[ 8260.830377] 11D: Error setting domain info in FW
[ 8260.835010] 11D: Error setting domain info in FW
[ 8260.924665] Fail to set EXTCAP IE
[ 8260.928011] Fail to set beacon proberesp assoc probereq IES
[ 8260.933602] Fail to set scan request IE
[ 8261.939443] Fail to set EXTCAP IE
[ 8261.942799] Fail to set beacon proberesp assoc probereq IES

 

So the problem is definitely not limited to scans.

This all also reminded me of another thead I opened in the past:
https://community.nxp.com/t5/Wireless-Connectivity/PCI9098-mxm-mwifiex-trigger-dump-when-connected-o...
which asked me to test with sleep mode disabled by adding the following parameters: "ps_mode=2 auto_ds=2"

I've been running with 2 days with these parameters set and so far have not reproduced the problem with that workaround, so this is a likely culprit. I'll update this topic if it happens again anyway or in a few more days if it still didn't happen as 2 days aren't enough to be certain.

Thanks.

0 Kudos
16 Replies

2,004 Views
martinetd
Contributor IV

This unfortunately will not help actually reproduce the problem, but when testing a workaround for another problem I noticed that this crash no longer happens if the interface is regularly reset (ip link set mlan0 down / ip link set mlan0 up) so this looks like a leak in the firmware under some conditions (since you cannot reproduce, I would assume it's related to either something in the background or some options enabled on my AP).

Cycling the interface brings its own set of problems (the SSID we're connected to sometimes no longer appears in scans), I'll open a new thread for this...

0 Kudos

3,071 Views
weidong_sun
NXP TechSupport
NXP TechSupport

Hello @martinetd ,

 

Yes, we have several options for connecting STA to APs.

such as , wpa_supplicant / networkmanager / iwd / iw etc.

But when we use one of them , other services should be masked or stopped. otherwise, multiple service will send scan command to wifi driver & firmware.

 

Regards,

weidong

0 Kudos

3,068 Views
martinetd
Contributor IV
Hi @weidong_sun.

I do not use multiple services at the same time.

Networkmanager uses wpa_supplicant (or iwd) and the same API as iw in the background.
My reproducer above was based on activity I have seen networkmanager doing and reproduced without it for clarity, but the problem happens with networkmanager alone, just as it happens with wpa_supplicant alone (in 5GHz) or wpa_supplicant + iw scan (in 2.4GHz)

Since the problems do not seem to happen with "modprobe moal ps_mode=2 auto_ds=2" I am suspecting a firmware bug.

Thank you.
0 Kudos

3,057 Views
martinetd
Contributor IV

Well, that was quick this time; I've attached full logs for wifi failing on 5GHz with wpa_supplicant.

This times the only errors in dmesg are as follow (no backtrace, yet another pattern of error...)

[37486.903088] Error in packet length.
[37486.906588] Error in packet length: total_pkt_len = 67, pkt_len = 87
[37486.912992] Error in packet length.
[37486.916629] Error in packet length: total_pkt_len = 49, pkt_len = 64

 

(anything after the first driver dump can be ignored and was caused by me getting a fw_dump)

 

The failures coming from multiple places really don't make sense to me from a software perspective, it looks like a corruption in the PCI link but only when sleep mode is enabled.

Let me also remind you that these traces are with the imx8mp evk with the BSP kernel and a single PCI workaround patch to make our revision work, but this also happens on our board with production soc without that patch. We also passed PCI compliance testing and our hardware team report no issue on the signal, so there really must be a problem when the firmware goes to sleep as it only happens when sleep is left enabled.

Since the driver didn't crash this time, perhaps something can be obtained from proc-mwlan-postfwdump/adapter0/fw_dump in the archive? I cannot do any analysis on these without NXP internal tooling.

Thank you,

Dominique

0 Kudos

3,053 Views
martinetd
Contributor IV

hmm, unfortunately while the situation looked better with sleep mode disabled I got another crash just now with our standard software stack and sleep mode disabled, so it wasn't a sufficient workaround....

 

But with yet another signature in dmesg:

[60825.919526] Block woal_cfg80211_del_key in abnormal driver state
[60825.925558] Block woal_cfg80211_del_key in abnormal driver state
[60825.931593] Block woal_cfg80211_del_key in abnormal driver state
[60825.937613] Block woal_cfg80211_del_key in abnormal driver state
[60825.943649] Block woal_cfg80211_del_key in abnormal driver state
[60825.949661] Block woal_cfg80211_del_key in abnormal driver state
[60825.957921] get fw info failed! status=-1, error_code=0x0
[60825.963372] 11D: Error setting domain info in FW
[60825.968028] 11D: Error setting domain info in FW
[60825.972689] get fw info failed! status=-1, error_code=0x0
[60825.978099] 11D: Error setting domain info in FW
[60825.982728] 11D: Error setting domain info in FW
[60826.056690] Fail to set EXTCAP IE
... follows usual scan timeout + fw dump messages

 

in 5GHz networkmanager does not issue regular scans, so this should be identical to the wpa_supplicant reproducer except that networkmanager might be enabling more client features?... I am not sure how to check that.

This time a fw dump happened naturally, please find it 20220915_es3/proc-mwlan/adapter0/fw_dump in the archive.

This happened with our software stack so the kernel version is higher but the wireless driver has been copied as is and is identical.

 

I am sorry for the continuous messages but we would like to sell our product ASAP and I am out of ideas at this point...

Thank you.

0 Kudos

3,037 Views
weidong_sun
NXP TechSupport
NXP TechSupport

Hello @martinetd ,

 

I am testing 5G, could you tell me how long it will take for this problem to recur?

I am using configuration file: lib/firmware/nxp/wifi_mod_para.conf to be module parameters.

===================

[ 67.366144] mlan: loading out-of-tree module taints kernel.
[ 67.403666] wlan: Loading MWLAN driver
[ 67.408480] wlan: Driver loaded successfully
[ 67.408629] wlan: Register to Bus Driver...
[ 67.417408] wlan_pcie 0000:01:00.0: enabling device (0000 -> 0002)
[ 67.425076] Attach moal handle ops, card interface type: 0x206
[ 67.431938] PCIE9098: init module param from usr cfg
[ 67.437032] card_type: PCIE9098, config block: 0
[ 67.441692] cfg80211_wext=0xf
[ 67.444676] max_vir_bss=1
[ 67.447327] cal_data_cfg=none
[ 67.450353] ps_mode = 1
[ 67.452806] auto_ds = 1
[ 67.455310] host_mlme=enable
[ 67.458224] mac_addr=00:50:43:20:12:34
[ 67.462038] fw_name=nxp/pcieuart9098_combo_v1.bin
[ 67.466787] rx_work=1 cpu_num=6
[ 67.469971] Attach mlan adapter operations.card_type is 0x206.
[ 67.477222] Request firmware: nxp/pcieuart9098_combo_v1.bin
[ 68.063907] FW download over, size 620880 bytes
[ 69.577186] WLAN FW is active
[ 69.580159] on_time is 69575020250
[ 69.597347] fw_cap_info=0x81c3fa3, dev_cap_mask=0xffffffff
[ 69.602864] max_p2p_conn = 8, max_sta_conn = 64
[ 69.619031] wlan: version = PCIE9098--17.92.5.p11-MM5X17283.p2-GPL-(FP92)
[ 69.627444] wlan_pcie 0000:01:00.1: enabling device (0000 -> 0002)
[ 69.633950] Attach moal handle ops, card interface type: 0x206
[ 69.640740] PCIE9098: init module param from usr cfg
[ 69.645804] card_type: PCIE9098, config block: 1
[ 69.650445] cfg80211_wext=0xf
[ 69.653424] max_vir_bss=1
[ 69.656043] cal_data_cfg=none
[ 69.659040] ps_mode = 1
[ 69.661525] auto_ds = 1
[ 69.663980] host_mlme=enable
[ 69.666898] mac_addr=00:50:43:20:52:56
[ 69.670690] fw_name=nxp/pcieuart9098_combo_v1.bin
[ 69.675552] rx_work=1 cpu_num=6
[ 69.678783] Attach mlan adapter operations.card_type is 0x206.
[ 69.689195] Request firmware: nxp/pcieuart9098_combo_v1.bin
[ 69.695395] WLAN FW already running! Skip FW download
[ 69.700494] WLAN FW is active
[ 69.703478] on_time is 69698340375
[ 69.720917] fw_cap_info=0x81c3fa3, dev_cap_mask=0xffffffff
[ 69.726475] max_p2p_conn = 8, max_sta_conn = 64
[ 69.740120] wlan: version = PCIE9098--17.92.5.p11-MM5X17283.p2-GPL-(FP92)
[ 69.748458] wlan: Register to Bus Driver Done
[ 152.366586] wlan: mlan0 START SCAN
[ 155.498255] wlan: SCAN COMPLETED: scanned AP count=5
[ 160.510746] wlan: mlan0 START SCAN
[ 164.783428] wlan: SCAN COMPLETED: scanned AP count=5
[ 169.795588] wlan: mlan0 START SCAN
[ 174.071025] wlan: SCAN COMPLETED: scanned AP count=5
[ 174.077890] wlan: sched scan start
[ 220.118034] wlan: sched scan stop
[ 288.923298] audit: type=1006 audit(1616581801.420:4): pid=792 uid=0 old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=3 res=1
[ 563.037423] wlan: mlan0 START SCAN
[ 567.318205] wlan: SCAN COMPLETED: scanned AP count=6
[ 567.388489] mlan0:
[ 567.388505] wlan: HostMlme Auth received from dc:XX:XX:XX:a4:7b
[ 567.418198] wlan: HostMlme mlan0 Connected to bssid dc:XX:XX:XX:a4:7b successfully
[ 568.608179] mlan0:
[ 568.608196] wlan: Send EAPOL pkt to dc:XX:XX:XX:a4:7b
[ 568.618420] mlan0:
[ 568.618431] wlan: Send EAPOL pkt to dc:XX:XX:XX:a4:7b
[ 568.626990] IPv6: ADDRCONF(NETDEV_CHANGE): mlan0: link becomes ready

===============================

 

Regards,

weidong

0 Kudos

2,496 Views
martinetd
Contributor IV

hi @weidong_sun 

 

For what it's worth I've tried again with our kernel and the lf-5.15.52-2.1.0 driver release, which does not help.

 

Everything seems to fail from here:

[80784.534747] Block woal_cfg80211_del_key in abnormal driver state
[80784.540783] Block woal_cfg80211_del_key in abnormal driver state
[80784.546811] Block woal_cfg80211_del_key in abnormal driver state
[80784.552841] Block woal_cfg80211_del_key in abnormal driver state
[80784.558864] Block woal_cfg80211_del_key in abnormal driver state
[80784.564876] Block woal_cfg80211_del_key in abnormal driver state
[80784.570916] IOCTL failed: 00000000229b08ae id=0x10000, sub_id=0x10003 action=1, status_code=0x80000007
[80784.580408] get fw info failed! status=-1, error_code=0x0
[80784.585849] 11D: Error setting domain info in FW
[80784.590496] 11D: Error setting domain info in FW

I've attached logs as usual, this time a fw_dump happened naturally so it is also included.

I've also just obtained access to the NDA version of the driver and will be trying that next.

 

(Meanwhile I've also been running for 30 days with power savings disabled and this only happened once (instead of once every 1-3 days), so we will keep going with power savings disabled but I would really like to see this fixed as it doesn't look like that is 100% safe even then)

 

Perhaps running with more debug logs on my side would help if you cannot reproduce?

 

Thank you

Tags (1)
0 Kudos

3,031 Views
martinetd
Contributor IV

hi @weidong_sun 

> I am testing 5G, could you tell me how long it will take for this problem to recur?

Thank you. This is quite hard to reproduce, these crashes takes 2 or 3 days to happen.

For the traces I gave in the two updates it happened after 10 and 16 hours respectively but that was rather fast; when I first tried to reproduce the issue we have had no crash for 16 days in a row so this is quite variable... I think it depends on noise in the environment as it seems to happen more often now I have started using 3 devices in parallel to test. Depending on your environment (e.g. AP capabilities) it is possible you will not be able to reproduce at all, my AP reportedly has wifi 6 capabilities enabled.

If there is any test firmware available that could produce more information from the fw_dump I would be happy to run any extra test for you

Thank you

0 Kudos

3,026 Views
weidong_sun
NXP TechSupport
NXP TechSupport

Hello @martinetd ,

 

According to your description, it is really hard to reproduce it. up to now I have tested it for 4 hours.

would you please try to test the latest bsp of i.mx8mp? 

the driver is released together with linux bsp.

Regards,

weidong

 

0 Kudos

3,022 Views
martinetd
Contributor IV
I will not have time to set BSP up before this weekend but I will test next week and report.
0 Kudos

3,001 Views
weidong_sun
NXP TechSupport
NXP TechSupport

Hello @martinetd ,

 

1. About ps_mode & auto_ds

ps_mode=1 & auto_ds=1, means that power saving mode & auto deep sleep mode are both enabled.

ps_mode=2 & audo_ds=2 meant that power saving mode & auto deep sleep mode are both disabled.

2. parameters that I used in my test 

PCIE9098_0 = {
cfg80211_wext=0xf
max_vir_bss=1
cal_data_cfg=none
ps_mode=1
auto_ds=1
host_mlme=1
mac_addr=00:50:43:20:12:34
fw_name=nxp/pcieuart9098_combo_v1.bin
}

PCIE9098_1 = {
cfg80211_wext=0xf
max_vir_bss=1
cal_data_cfg=none
ps_mode=1
auto_ds=1
host_mlme=1
mac_addr=00:50:43:20:52:56
fw_name=nxp/pcieuart9098_combo_v1.bin
}

3. My test result

So far, I have been testing 5G for 72 hours without any problems.

 

Regards,

weidong

 

0 Kudos

2,954 Views
martinetd
Contributor IV

Thank you for running the test. It looks like it might be related to something in my office environment...

I have just setup our evk board with the latest BSP (by date) which is 5.10.72-2.2.2, please tell me if using the 5.15.32-2.0.0 would be better.

I will report back on Monday with the test result so far, unless I can observe an earlier crash.

Thanks,

Dominique

0 Kudos

2,941 Views
martinetd
Contributor IV

I was able to reproduce overnight with the BSP. My guess would be that our AP has some wifi6 features that you might not have available on your test environment, or noise in our office makes these problems much easier to hit?

I've attached 20220921_evk.tar.gz with the usual traces but here's the only interesting part form dmesg (anything past "=== START DRIVER INFO DUMP===" has been caused by me)

[12869.619569] Error in packet length.
[12869.623069] Error in packet length: total_pkt_len = 66, pkt_len = 86
[12950.417213] Error in packet length.
[12950.420787] Error in packet length: total_pkt_len = 91, pkt_len = 101
[12980.848354] Error in packet length.
[12980.851854] Error in packet length: total_pkt_len = 67, pkt_len = 87
[13000.911763] Error in packet length.
[13000.915263] Error in packet length: total_pkt_len = 64, pkt_len = 84

 

This is with the latest (in time) BSP, 5.10.72-2.2.2:

 - uboot 2021.04-lf_v2021.04+g8372631b28

 - linux https://github.com/atmark-techno/linux-5.10-at/commits/nxp_5.10.72-2.2.2-repro which is exactly lf-5.10.72-2.2.2 + pci workaround for older soc revision. We do not need that extra patch for our boards with a released chip, so it must not be related.

 - moal.ko, firmware and userspace used are exact tarball content from BSP archive from NXP site: LF_v5.10.72-2.2.2_images_IMX8MPEVK/imx-image-full-imx8mpevk.tar.bz2

Thank you

0 Kudos

2,850 Views
martinetd
Contributor IV

hi @weidong_sun 

Have you been able to look at the traces I provided?

I'd be happy to run any other test if required; it's really unfortunate that this is so hard to reproduce but if this happens here it is bound to happen for some of our clients as well...

Thanks,

Dominique

0 Kudos

3,079 Views
weidong_sun
NXP TechSupport
NXP TechSupport

Hello @martinetd ,

 

If you only use wpa_supplicant, don't use iw command, does the issue occur?

 

Regards,

weidong

0 Kudos

3,075 Views
martinetd
Contributor IV

The last backtrace on a 5GHz AP happens with just wpa_supplicant and no scan at all (only curl commands), yes. I will set the evk up to record traces for this one if you'd like, it usually takes 3-4 days.

The issue I provided stack traces for requires scans, but these naturally happen when we use networkmanager so not running scan is not convenient for us.

 

Thank you

0 Kudos