Our application has two types of devices: The nodes have a KW41Z Processor and run a custom firmware. The gateway uses the Host Controlled Device firmware from the SDK on a KW41z to route the data through a linux system to the outer internet. Both firmwares were made with recent versions of the MCUXpresso SDK (From 16.04.2018). EDIT: Sorry, correction: The offending node firmware still used the SDK released on 19.01.2018.
The node firmware is in a deep sleep state for most of the time. Only after some interval will it wake up, initialize the Thread stack, attach to the network, send some data, disconnect from the network, reset the Thread stack, and go back to sleep. The gateway is always on an manages the network as the leader. The network is precommissioned.
In one installation we found that after running flawlessly for almost a month, the nodes can not communicate with the gateway any more. We drove to the installation and ran a 6LoWPAN sniffer. Here is the complete listing of the communication we saw:
The Attachment to the precomissioned network works. The node then sends its first UDP packet from Port 3002 to the Gateway, Port 61630 (Packet 13). That packet is ACK'ed. It uses EIDs as IP adresses. The Gateway then sends Thread Address Queries on Port 61631 as described in the Thread specification. Those queries aren't answered, so the the Gateway can not resolve the node's EID to an RLOC, and can therefore never send the response UDP packet.
We update the node firmware. That new firmware also uses the 16.04.2018-SDK. The Gateway was not reset or changed at all. Now the communication works and looks as follows:
The UDP Request is in Packet 15. The Gateway then sends the Address Query with that EID. The node responds with an Adress Notification that contains the EID and its; RLOC. The Gateway then sends the UDP response in Packet 24.
I checked the contents of the address queries and notifications. The Queries are correct and the same in both communications (except the EID: Of course they always contain the EID of the sender of the previous UDP packet.)
The Notifications in the second log are also correct.
1. What could lead to the node not responding to the Address Queries? The radio connection definitely works and has worked before. The Thread stack in the node must receive the query and then fail to respond or choose to ignore the Query. Why? It is interesting that that problem occured after several days, but then all at once for all four nodes in that installation. They wer started at the same time and wake up with the same interval, so maybe some counter in the Thread stack is not reset properly when I call THR_SoftwareReset() and overruns eventually, leading to this behaviour? Is there some other system that I have to reinitalize properly? Any idea will help.
2. In the second log, the Gateway sends two more address queries even when the node already answered the query. Those other two queries are therefore unnecessary. Why does the Thread stack send those?