[88W9098] firmware bug when scanning regularly: packet replies do not come back after a while

martinetd · ‎09-08-2022

Hello,

We've been testing an AW-XM458 wireless module (88W9098) on our i.MX8 MP board and have been chasing this station problem for a while with no success: transfers work for a while but after a few days connections are lost and we cannot establish new connections. The adapter still shows connected to the AP and ip link set mlan0 down / up restores functionality, but this is not acceptable in production.

The problem happens naturally with networkmanager after a while, but we were able to reproduce it without network manager with the following setup:

- wpa_supplicant connection to AP on mlan0 (in this case 2.4GHz, confirmed similar bug with 5GHz), no wpa_supplicant option (config just defines network block with WPA2/CCMP TKIP available in attachment)
- running curl in a loop to simulate network activity.
- `iw mlan0 scan` on 2.4GHz frequencies in a loop, sleeping 300 seconds (5minutes) between each. The frequency list is to emulate what networkmanager has been doing (OBSS scans to avoid conflicts with neighbor frequencies), but when I reproduced with the 5GHz AP I kept the same scan so I believe that has no impact.

With that setup the board can run transfers successfully for anywhere from a few hours to a few weeks, but generally after 2-3 days transfers start to fail with either "host isunreachable" or "Connection timed out" depending on the variant (see below)

This happens on our board, but I've been able to reproduce it on imx8mp-evk with BSP userland and almost-BSP kernel (our evk is old and requires a patch to have pci working, we reproduce it without that patch on our board):

kernel: https://github.com/atmark-techno/linux-5.10-at/commits/nxp_5.10.72-2.2.0-repro
wifi driver: http://source.codeaurora.org/external/imx/mwifiex lf-5.10.72-2.2.2
firmware: https://github.com/NXP/imx-firmware/tree/lf-5.10.72_2.2.0/nxp/FwImage_9098_PCIE ; in dmesg PCIE9098--17.92.5.p11-MM5X17283.p2-GPL-(FP92).

I've attached all the traces I could think of; you'll find in the archive:

- setup_mlan_scan_test.sh and curl_scan_parallel.sh are our reproducer script. running this for a few days trigger the issue reliably for us.
- dmesg.log : output of dmesg -w for the whole duration of the test. The error happened at around 01:30 on the 8th, which translates to around 60000 in dmesg log but there are no errors anywhere anyway. The messages towards the end are when I was taking traces in /proc.
- setup.log : output of setup script including wpa_supplicant config used
- curl.log : output of curl_scan_parallel.sh for the whole duration of the test
- ip-a : output of ip -a to illustrate 172.16.69.235 is assigned to mlan0 for below test. This setup is a bit weird with the same subnet used for both eth0 and mlan0, but all commands properly specify --interface mlan0 and work correctly. I've reproduced this without that weird setup in the first place so this has no impact on e.g. driver behaviour.
- tcpdump-remote-host.pcap and tcpdump-remote-host.txt are pcap traces from another host illustrating the problem: we receive the SYN packets but the SYNACK reply never comes and linux keeps sending SYN. The later arp/icmp requests worked well.
- proc-mwlan/ is a copy of /proc/mwlan after having produced the bug.
- mlanctl/ since I suspect a firmware bug I wanted to dump the firmware for traces, this is the sources of the utility along with the fwdump.conf I used. There was no example provided, but it seemed to work.
- proc-mwlan-adapter0-dump/ is a new copy of fw_dump and drv_dump after the dump

More in depth analysis:

- When I first reproduced this issue, I had the exact opposite behaviour: tcp (ipv6, ipv4) would come back correctly but ARP and ICMP would not. This is the "host unreachable" pattern, which arp going through and tcp not working is the "connection timed out" pattern.
- I added debug to our kernel for analysis and determined that the normal packet reception mechanism is as follow:

woal_main_work_queue+80 
  -> mlan_main_process+1908
  -> mlan_queue_rx_work+168
  -> moal_recv_event
  -> queue_work(priv->phandle->rx_workqueue,
                &priv->phandle->rx_work);
woal_rx_work_queue+88
  -> mlan_rx_process+464
  -> wlan_handle_rx_packet+316
  -> wlan_ops_sta_process_rx_packet+520
  -> mlan_11n_rxreorder_pkt+936
  -> wlan_process_rx_packet

- When the error happened, I could confirm mlan_main_process() is called regularly but mlan_queue_rx_work is not called when the problematic packets should be incoming. It is properly called on other packets (e.g. tcp when arp does not pass). This makes me suspect that this is a bug in the firmware that does not transmit the packets it should to the driver.
- as said earlier toggling interface down and up seems to fix the issue, but in some case I've also seen the firmware recover on its own after stopping all transfers and scans for a few hours. In that case I could see curl failing to connect for a whole night so I do not believe it would recover automatically normally, but this might explain why this might not be such a big issue in normal use? It is nevertheless a showstopper for us, as we cannot stop scans easily with networkmanager and some workloads require constant transfers (e.g. video feed)

Extra notes:

- I also tried firmware 17.92.5.p8 as its release note in git noted "PCIE9098: Command timeout 0x107 observed while running Tx UDP traffic to two stations with scan triggered for every 30s in background" which might have been related, to no difference.
- As I do not believe you will be able to easily reproduce the problem, I would be more than glad to test a different firmware if you can provide one to us, we also have a NDA agreement with NXP so more details in firmware would be appreciated.
- We have another problem when the time between scans is much shorter (e.g. 5s instead of 5mins) where we get hard errors in dmesg (tx timeout or scan timeout) and the connection completely drops, I'm in the process of taking better traces for it and will open a separate thread for that as I believe the issues are different even if reproducer is close. But if you have any known issues about scans I definitely am interested.

Thank you,

Dominique Martinet

ocourson · ‎10-12-2022

Dear @martinetd

So issue seems linked with power saving, as disabling power saving is fixing it.

Can you keep power saving disabled (ps_mode=2 auto_ds=2) as a workaround, or do you really need to have power saving enabled ?

Regards,

Olivier

martinetd · ‎10-17-2022

hi @ocourson

- I haven't reproduced yet on evk with new driver/firmware, but I had taken down the http server used as remote for curl loop over the weekend so I'd like to let it run a bit longer before saying anything

- ... on the other hand, I just got "lucky" on one of my test board with power savings disabled and could apparently reproduce this after 30 days.......... This is going to be impossible to debug, but we can probably safely say that disabling power savings just made the issue harder to hit and didn't fix the root of the problem.

I'm going to convert my two other test boards to use the new firmware, with power savings left on their default, and see how it goes.

Thanks.

martinetd · ‎10-12-2022

Our current plan is to start selling our product with power saving disabled for now, but we've measured about 1.2W difference in connected idle when power saving kicks in so in the long term we'd like to enable it back when this is fixed.

ocourson · ‎10-12-2022

Dear @martinetd

Could you please test with latest Wifi driver/FW releases ?

- Wifi driver now moved from codeaurora to github:

git clone http://github.com/nxp-imx/mwifiex.git
git checkout lf-5.15.52-2.1.0

- And NXP Wifi/BT firmware:

git clone https://github.com/NXP/imx-firmware
git checkout lf-5.15.52-2.1.0

Regards,

Olivier

martinetd · ‎10-12-2022

thanks for the reply,

I've updated just the wifi driver (keeping kernel on 5.10.72-2.2.2) and firmware, and restarted the test.

It can take a few days to reproduce so I'll report back early next week if it hasn't reproduced until then.

By the way, I had been running with ps_mode=2 auto_ds=2 for the past month and haven't run into issues with that. I've kept the default values (modprobe moal without any argument or config file) for this test.

[1147417.626507] wlan: version = PCIE9098--17.92.1.p116.1-MM5X17344.p3-GPL-(FP92)

Thanks,

--

Dominique

martinetd · ‎12-14-2022

hi @ocourson

sorry for the (very) late reply; I've currently upgraded to PCIE-WLAN-UART-BT-9098-U16-X86-17.68.1.p136.13-17.26.1.p136.13-MXM5X17366_P3_V0V1-MGPL (wlan: version = PCIE9098--17.68.1.p136.13-MXM5X17366.p3-GPL-(FP68)) which is the latest version available as far as I'm aware.

With ps_mode=2 auto_ds=2, I've hit this same bug after 15 days of continuous testing, so as written above (reproduced after 30 days the other time) I think we can safely say this is not resolved.

The symptoms are exactly the same: wlan listed as connected in iw mlan0 link output, no error in logs (nothing in dmesg this time) but curl outputs "operation timed out" trying to connect as tcp -- this is the ARP working pattern -- for 5 hours until I reset the connection.

I also cannot seem to be able to reproduce it as easily with default power savings due to other bugs there and also because the connection seems less stable, I also have traces in logs that look similar but networkmanager reconnects to the AP every few hours so I haven't been able to actually catch it here. It also seems less likely than before, so perhaps something has been fixed, or maybe the conditions under which the AP I use is have changed over time, I honestly cannot say...

I understand I haven't provided enough information here to properly address the issue unless you find a way of reproducing it, and given the hurdle (multiple days of continuous testing, probably dependent on which AP we connect to), would there be some level of logs that would be interesting if we managed to reproduce it on our side with these logs? Once the problem has happened all I can prove is that we're not getting any interrupt from the device, but that doesn't help identifying a root cause.

Thank you,

Dominique Martinet

[88W9098] firmware bug when scanning regularly: packet replies do not come back after a while

[88W9098] firmware bug when scanning regularly: packet replies do not come back after a while

OS LINUX

Product: WiFi 88X9098

Protocol: Wi-Fi