Our application has two types of devices: The nodes have a KW41Z Processor and run a custom firmware. The gateway uses the Host Controlled Device firmware from the SDK on a KW41z to route the data through a linux system to the outer internet. Both firmwares were made with recent versions of the MCUXpresso SDK (From 16.04.2018). EDIT: Sorry, correction: The offending node firmware still used the SDK released on 19.01.2018.
The node firmware is in a deep sleep state for most of the time. Only after some interval will it wake up, initialize the Thread stack, attach to the network, send some data, disconnect from the network, reset the Thread stack, and go back to sleep. The gateway is always on an manages the network as the leader. The network is precommissioned.
In one installation we found that after running flawlessly for almost a month, the nodes can not communicate with the gateway any more. We drove to the installation and ran a 6LoWPAN sniffer. Here is the complete listing of the communication we saw:
The Attachment to the precomissioned network works. The node then sends its first UDP packet from Port 3002 to the Gateway, Port 61630 (Packet 13). That packet is ACK'ed. It uses EIDs as IP adresses. The Gateway then sends Thread Address Queries on Port 61631 as described in the Thread specification. Those queries aren't answered, so the the Gateway can not resolve the node's EID to an RLOC, and can therefore never send the response UDP packet.
We update the node firmware. That new firmware also uses the 16.04.2018-SDK. The Gateway was not reset or changed at all. Now the communication works and looks as follows:
The UDP Request is in Packet 15. The Gateway then sends the Address Query with that EID. The node responds with an Adress Notification that contains the EID and its; RLOC. The Gateway then sends the UDP response in Packet 24.
I checked the contents of the address queries and notifications. The Queries are correct and the same in both communications (except the EID: Of course they always contain the EID of the sender of the previous UDP packet.)
The Notifications in the second log are also correct.
1. What could lead to the node not responding to the Address Queries? The radio connection definitely works and has worked before. The Thread stack in the node must receive the query and then fail to respond or choose to ignore the Query. Why? It is interesting that that problem occured after several days, but then all at once for all four nodes in that installation. They wer started at the same time and wake up with the same interval, so maybe some counter in the Thread stack is not reset properly when I call THR_SoftwareReset() and overruns eventually, leading to this behaviour? Is there some other system that I have to reinitalize properly? Any idea will help.
2. In the second log, the Gateway sends two more address queries even when the node already answered the query. Those other two queries are therefore unnecessary. Why does the Thread stack send those?
Thanks for the details of your project. Now, to answer to your questions will be useful to have a better overview of the problem. Can you share with us a Wireshark capture ?
A source for the issue could be related to the security. In app_thread_config_config.h file are some defines related to the switch key interval (THR_KEY_ROTATION_INTERVAL_HOURS, THR_KEY_SWITCH_GUARD_TIME_HOURS). The default values are 672/ 624 hours (@ one month). Please check the Auxiliary Security Header (address query packet) to see the value for key index used by the gateway and compare with the value used by the node.
Could be also an issue related to MAC counters. This could be easily investigated by adding some code/breakpoint in MAC_MlmeCommStatusIndCB function (mac_abs-802.15.4.c file) to see if you receive from MAC layer a packet with status: gMacAbsCounterError_c.
As I saw, your node acts as REED device (FFD device type) with low power support. It is any reason for not using an end device/sleepy end device? Acting as an end device the node will use registration TLV (Child Update Request command) to register unicast addresses to parent (basically to notify the parent to use indirect transmission for forwarding packets destined to the addresses contained in the address registration tlv). In this way you are avoiding the usage of Address query. (please consider the latest KW41 sdk version released in August - it includes some library fixes for sleepy devices compared with version from April).
There could be also some other discussion regarding the out of band provisioning and rejoining mechanism that you are using. In some cases the synchronization after reset make more sense, instead of using disconnect and THR_SoftwareReset (that I assume is not use with factoryreset flag set). Did you tried only using disconnect command and optionaly ResetMCU (function that will physically reset the node)?
I will try to create a similar setup on my office, by reducing the value for key rotation interval and I will let you know about my result. In the meantime please provide us a wireshark log to do more investigation on your setup.
Thanks for your response.
I attached the two logs for Version 1.1.1 of our software that showed the erroneous behaviour, and Version 1.1.8 that didn't show the behaviour (yet) after the Firmware Upgrade. The Decryption key for IEE802.15.4 is the default, 0011...eeff.
The Key Indices seem to be okay. They are always 0x1.
I can not recreate the issue on my desk, it appears in installations after several weeks. Therefore, live debugging with breakpoints and variable inspection is not really an option and I can not check for the gMacAbsCounterError_c. It is a good idea to add that check to the code, though. If that error appears in that callback - What should happen next? How can the software recover from that error? Should I reset the IEE802.15.4-stack? Should I inspect the global variables to see which counter overflowed?
I do call THR_SoftwareReset with the factoryReset flag set to TRUE. Resetting the MCU completely is not an option because the application a quite complex application state. After a reset, it would be very hard to reconstruct. In the call with Robert David, he explained to me that calling THR_Software_Reset is not the right course of action anyway and I should use THR_Detach and THR_NwkJoin instead. I will try that.