We've mkw01 based end nodes running the 802.15.4g stack that are mains powered (so, no need for low power modes) that are supposed to be in the RxOnWhenIdle mode all the time. But, every once in a week or sometimes even a month, one of the nodes stops receiving data. Once we force the node to Tx a packet, it starts receiving again until the next time we hit this bug.
After going through a lot of trouble trying to reproduce this issue, we finally have one node attached to the debugger that finally hit the bug! The problem, it seems, is that somehow the Phy/Radio went into idle mode even though at init we set RxOnWhenIdle to true. We went through a bunch of our event logs and figured that this node stopped receiving right after it sent a broadcast packet about its status. Almost all our communication is unicast and Ack-based but every once in a while (depending on a lot of factors related to the dynamics of the network topology) one of the nodes broadcasts a status. And right after this node sent the broadcast packet, it went into Idle. Note that over the last week or so the same node has been sending the same broadcast packet but it did not stop Rx until yesterday. So, the problem doesn't seem to happen every time a packet is broadcast.
Now, I went through some of the Phy code and found that the PhyTxPacketSentEvent *might* be suspicious. I could very well be wrong about it because I don't know the Phy code too well but it seems that if an Ack is not expected (as is the case when a packet is broadcast), the PhyState is set to Idle and the Radio never goes into Rx. I do understand that this code worked for over a week before incorrectly going into the Idle state, so, there is some other code in the Phy state machine that checks if RxOnWhenIdle is true and puts the radio in the Rx mode. But I suspect there is some unhandled race condition in/around this piece of code which is causing us a lot of pain.
Can anyone please help shed some light on this?
It's normal that after each sequence the PHY is put to idle. Can you please check thay PHY_EventSet and then OS event set never fail? After each sequence the PHY needs to run PHY task code to restart RX.
What you can do additionally is to check the next code that it's always in idle, then Rx on flag is true as this is the code that starts RxOnWhenIdle on PHY task.
/* Check if PHY can enter Idle state */
if( gIdle_c == PhyGetSeqState() )
Please let me know your results.
Thanks for taking a look at this! All of last week, we tried to reproduce the bug with the debugger attached to a device but we couldn't get it to happen. The device attached to the debugger did not fail but one other device did fail in some other testing. On inspection, we figured that this one did not stop after sending a broadcast packet. So, it seems that the issue isn't related to broadcast at all. I put a breakpoint in the PhyEnterIdle functions else where the radio stays in the Idle state. (The bold line in the code snippet below). I'll also set a breakpoint in Phy_EventSet to check for the failure case of OSA_EventSet. Please let me know if there are any other places in the code where adding a breakpoint would help.
void Phy_EnterIdle( void )
if( (TRUE == phyLocal.flags.rxOnWhenIdle) &&
(FALSE == phyLocal.flags.cslRxEnabled) &&
(FALSE == phyLocal.flags.tschEnabled) )
phyLocal.flags.idleRx = 1;
Phy_SetSequenceTiming( gPhySeqStartAsap_c, mPhyMaxIdleRxDuration_c, gRX_c, 0 );
(void)PhyPlmeRxRequest( gPhyUnslottedMode_c, (phyRxParams_t *) &phyLocal.rxParams );
phyLocal.flags.idleRx = 0;
While we really need to solve this issue, in the short term, can you suggest a work around that I can add to the application layer to detect and solve this issue at run time? For example, if I can somehow detect from the main application task that the Phy is incorrectly idle when it shouldn't be, I can tickle the Phy task by sending a dummy packet or setting a PIB or something like that. That might work because we know for certain that if a device in this bad state transmits a packet, the radio then correctly goes into RxWhenIdle. The problem is that we have an upcoming batch of a few thousand devices and we need this issue fixed before the new batch is ready to be flashed (in less than a week's time!). Updating the code on an entire batch is a pretty expensive process at this stage.
Is there a way to use UART or LEDs on your custom KW01 boards? I am thinking in send UART messages or toggling LEDs to catch what is happening at run time.
If some action on PHY task is triggered, either Tx or Rx, everything is back to normal. This tells me that RxOnWhenIdle flag is set, but the PHY task did not schedule when a PHY action/timeout ocurred.
Yes, we do have 4 LEDs that we can play around with but no UART. Where do you suggest that we toggle the LEDs?
Btw, we just happened to come across the bug with the debugger attached. I know this is going to sound very weird but rxOnWhenIdle was set to 0 and hence the Phy "correctly" went into the idle mode. Please find the attached screenshot. I'll restart the testing after setting a breakpoint in PhyPlmeSetRxOnWhenIdle, which seems to really be the only place rxOnWhenIdle is updated, to check why this value was updated. There are 2 places in the app code where rxOnWhenIdle is set but both of them set it to true, so I'm not sure how or if the app the could cause this issue.
So, however bad it does sound, for now, as a workaround, can I just check if rxOnWhenIdle is true and set it to true if it is false from the app's main loop? What could be the side effects of unnecessarily using the SAP_Handlers from the main loop?
The screenshot from the last reply was spurious. Turns out the watchdog got triggered and the breakpoint was hit from the Init code path. In that context, the rxOnWhenIdle being 0 makes complete sense. I'm extremely sorry for making you chase the wrong code path.
That said, the original problem still exists where the radio incorrectly goes into the Idle state. I'm having a hard time reproducing the bug with a debugger attached and have no additional information to share at this time. In the mean time, can you please help me with a patch where the app can check if the Phy is idle and somehow force a dummy event to retrigger the Phy task? This bug is now holding up the production and we need to urgently find a work around before flashing code on to the next batch of devices.
To catch the issue you should check where the rxOnWhenIdle is set to 0, this is a PIB in the MAC which you have R/W access to.
Let me check if a patch is feasible with the suggested behavior and I will let you know asap.
I've put a breakpoint in PhyPlmeSetRxOnWhenIdle to catch where the rxOnWhenIdle flag is updated but have had no luck with reproducing the issue. In the app code we set that PIB in 2 different code paths (both during initialization) and both of them set it to true. Also, if somehow the PIB was set to false, triggering a TX would probably not fix the problem like we've noticed every single time.
I'll look forward to the patch. Also, we're still actively trying to reproduce the problem and will let you know as soon as we find anything!
Could you please put a breakpoint at the PHY value that is 0 and should be 1? If everything is configured properly then you can read the MAC PIB to see if is also 0. If MAC is 1 and the PHY is 0 then we have a PHY issue that will be revealed by the data breakpoint.
A patch could be feasible if we detect that it's a PHY layer issue where we don't have access to.
We've been able to reproduce the problem with a debugger again! First of all, the problem isn't that rxOnWhenIdle is 0. It is 1, as expected but the mPhyState is still 0 (gIdle_c) like noted before. I've tried to gather as much PHY state as I could and I'd love to dig into this further and get you any other state that you'd like (fortunately, the debugger is still attached! And I'll try to keep it that way until we are sure we don't need it anymore).
Here's what I think I've found so far:
1. rxData was:
rxData <array>" " 0x20001A00 uint8_t
 '.' (0x02) 0x20001A00 uint8_t
 '\0' (0x00) 0x20001A01 uint8_t
 'Œ' (0x8C) 0x20001A02 uint8_t
 '\r' (0x0D) 0x20001A03 uint8_t
 '.' (0x10) 0x20001A04 uint8_t
 '€' (0x80) 0x20001A05 uint8_t
 '\0' (0x00) 0x20001A06 uint8_t
 '\0' (0x00) 0x20001A07 uint8_t
 '.' (0x0F) 0x20001A08 uint8_t
 '\0' (0x00) 0x20001A09 uint8_t
 'z' (0x7A) 0x20001A0A uint8_t
 '.' (0x1E) 0x20001A0B uint8_t
 '.' (0x04) 0x20001A0C uint8_t
 '\r' (0x0D) 0x20001A0D uint8_t
 '.' (0x10) 0x20001A0E uint8_t
 '\0' (0x00) 0x20001A0F uint8_t
 '.' (0x01) 0x20001A10 uint8_t
 '\0' (0x00) 0x20001A11 uint8_t
 'z' (0x7A) 0x20001A12 uint8_t
 '.' (0x1E) 0x20001A13 uint8_t
 '.' (0x04) 0x20001A14 uint8_t
Note that the frame control lsb is 0x02 which means that the last packet received was an ACK from the device with dst address = 0x041E7A000F000080 and src address = 0x041E7A000100100D. The src is the co-ordinator in the network that the "hung" device belongs to and its address is 0x041E7A00040003B5. Its very likely that everything in the rxData after index 2 is from the previous packet, but from our application logs on the co-ordinator, it looks like the radio hung right after the device 0x041E7A000F000080 sent a status packet to the co-ordinator, followed by the co-ordinator replying with an ACK.
2. Just to note again, mPhyState '\0' (0x00) 0x20002AD7 uint8_t
3. The phyLocal state:
phyLocal <struct> 0x1FFFF1D8 Phy_PhyLocalStruct_t
PD_MAC_SapHandler 0x0000B521 0x1FFFF1D8 PD_MAC_SapHandler_t
PLME_MAC_SapHandler 0x0000B543 0x1FFFF1DC PLME_MAC_SapHandler_t
phyTaskEventId <struct> 0x1FFFF1E0 event_t
macPhyInputQueue <struct> 0x1FFFF204 list_t
head 0x00000000 0x1FFFF204 struct listElement_tag *
tail 0x00000000 0x1FFFF208 struct listElement_tag *
size 0 0x1FFFF20C uint16_t
max 0 0x1FFFF20E uint16_t
maxFrameWaitTime 7610 0x1FFFF210 uint32_t
txParams <struct> 0x1FFFF214 phyTxParams_t
numOfCca '\0' (0x00) 0x1FFFF214 uint8_t
ackRequired gPhyRxAckRqd_c 0x1FFFF215 phyAckRequired_t
<union> 0x1FFFF218 union <Unnamed 98>
rxParams <struct> 0x1FFFF218 phyRxParams_t
timeStamp 856552102 0x1FFFF218 uint64_t
psduLength '.' (0x05) 0x1FFFF220 uint8_t
linkQuality '8' (0x38) 0x1FFFF221 uint8_t
headerLength '.' (0x03) 0x1FFFF222 uint8_t
macDataIndex '.' (0x02) 0x1FFFF223 uint8_t
fifoBlockLen '.' (0x05) 0x1FFFF224 uint8_t
phyHeader <struct> 0x1FFFF226 phyPHR_t
<union> 0x1FFFF226 union <Unnamed 69>
mask 24 0x1FFFF226 uint16_t
byteAccess <array>" " 0x1FFFF226 uint8_t
 '.' (0x18) 0x1FFFF226 uint8_t
 '\0' (0x00) 0x1FFFF227 uint8_t
<struct> 0x1FFFF226 struct <Unnamed 68>
modeSwitch '\0' (0x00) 0x1FFFF226 uint8_t
reserved '\0' (0x00) 0x1FFFF226 uint8_t
fcsType '.' (0x01) 0x1FFFF226 uint8_t
dataWhitening '.' (0x01) 0x1FFFF226 uint8_t
frameLengthRsvd '\0' (0x00) 0x1FFFF226 uint8_t
frameLength '\0' (0x00) 0x1FFFF227 uint8_t
channelParams <struct> 0x1FFFF218 phyChannelParams_t
<union> 0x1FFFF218 union <Unnamed 97>
channelStatus '¦' (0xA6) 0x1FFFF218 phyStatus_t
energyLeveldB '¦' (0xA6) 0x1FFFF218 uint8_t
ccaParam 'ò' (0xF2) 0x1FFFF219 uint8_t
flags <struct> 0x1FFFF228 phyFlags_t
<union> 0x1FFFF228 union <Unnamed 91>
mask 2053 0x1FFFF228 uint32_t
<struct> 0x1FFFF228 struct <Unnamed 89>
rxOnWhenIdle 1 0x1FFFF228 uint32_t
rxFramePending 0 0x1FFFF228 uint32_t
idleRx 1 0x1FFFF228 uint32_t
ccaBfrTX 0 0x1FFFF228 uint32_t
rxAckRqd 0 0x1FFFF228 uint32_t
autoAck 0 0x1FFFF228 uint32_t
panCordntr 0 0x1FFFF228 uint32_t
promiscuous 0 0x1FFFF228 uint32_t
activePromiscuous 0 0x1FFFF228 uint32_t
cslRxEnabled 0 0x1FFFF228 uint32_t
rxEnhAckRqd 0 0x1FFFF228 uint32_t
ccaComplete 1 0x1FFFF228 uint32_t
tschEnabled 0 0x1FFFF228 uint32_t
filterFail 0 0x1FFFF228 uint32_t
rxIsListen 0 0x1FFFF228 uint32_t
reserved 0 0x1FFFF228 uint32_t
startTime 18446744073709551615 0x1FFFF230 uint64_t
phyUnavailableQueuePos 0 0x1FFFF238 uint16_t
phyIndirectQueue <array> 0x1FFFF23A uint16_t
 0 0x1FFFF23A uint16_t
 0 0x1FFFF23C uint16_t
 0 0x1FFFF23E uint16_t
 0 0x1FFFF240 uint16_t
 0 0x1FFFF242 uint16_t
 0 0x1FFFF244 uint16_t
 0 0x1FFFF246 uint16_t
 0 0x1FFFF248 uint16_t
 0 0x1FFFF24A uint16_t
 0 0x1FFFF24C uint16_t
fcs 0x33B0 0x1FFFF24E uint16_t
macPanID <array>" þÿµ " 0x1FFFF250 uint8_t
 '\r' (0x0D) 0x1FFFF250 uint8_t
 '.' (0x10) 0x1FFFF251 uint8_t
macShortAddress <array>"þÿµ " 0x1FFFF252 uint8_t
 'þ' (0xFE) 0x1FFFF252 uint8_t
 'ÿ' (0xFF) 0x1FFFF253 uint8_t
macLongAddress <array>"µ " 0x1FFFF254 uint8_t
 'µ' (0xB5) 0x1FFFF254 uint8_t
 '.' (0x03) 0x1FFFF255 uint8_t
 '\0' (0x00) 0x1FFFF256 uint8_t
 '.' (0x04) 0x1FFFF257 uint8_t
 '\0' (0x00) 0x1FFFF258 uint8_t
 'z' (0x7A) 0x1FFFF259 uint8_t
 '.' (0x1E) 0x1FFFF25A uint8_t
 '.' (0x04) 0x1FFFF25B uint8_t
currentMacInstance '\0' (0x00) 0x1FFFF25C uint8_t
4. To put it all together: (This part I'm not very sure about!)
- The radio started receiving an ACK sent by the co-ordinator and meant for another device in the network. (because frameControlLsb = 0x02)
- While receiving and parsing the incoming packet not meant for this device, something bad happened right after the 2nd byte was received or when Phy_RxFrameFilter was called for macDataIndex == 2 (I think so because macDataIndex is stuck at 0x02 where as the frameLength is 0x05)
- I also noted that mSeqNumber is 0xAC on the debugged device but rxData == 0x8C, which indicates that whatever last ACK/packet was received was not meant for this device.
- Maybe the current packet reception was correctly aborted but the radio did not go into Rx as expected?
The device is paused in the debugger, so, feel free to ask for any additional details that you may require to debug this further. I can try for as long as possible, but cannot guarantee, to keep the debugger attached (because I've had issues where our debugger gets a mind of its own and gets disconnected once in a while). Also, if I'm on the right track with my analysis, you may want to take a look at the part where bad packets are discarded and RxRestart is called. The chances of you finding the bug, if there is one, are way higher than me! :-)
Also, we've almost run out of time and would like a fix/patch/workaround as soon as possible because our next batch is being held up for this bug and I'm not sure if and when I can reproduce this again! And besides, we need to test the patch before we can deploy, so that'll need some time as well. So, please hurry! And thanks!
Hi Rohit, sorry for the delay, I was out of office.
I think the first step it's to reproduce the issue in my side and verify that it's a PHY issue and then provide a patch to you.
It's feasible to provide your code to me and doing the debugging? How different it's your custome KW01 board than FRDM-KW01? I will do my testing in FRDM-KW01 so it could be fine if I have your schematics.
What RF parameters are you configuring? frequency, rxbandwidth, fdev, etc.
How to test your application?
The app team provided a patch for the cause and we've been testing it for about 2 weeks with 10 devices. It does look like the issue has been fixed! Thanks a ton for working with us on this issue! :-)
For the record, here's what helped fix the bug (Note that we were never able to nail the exact cause but had enough detail to come up with a patch):
In PhyTimeHandleTimeoutEvent, added a case for gIdle_c which just falls through to the gRX_c case. And in PhyRxHandleFifoLevelEvent, if the RX filter fails, send a dummyEvent to the Phy_Task before breaking out of the loop. The problem was that if an event timeout occurred in the middle of an RX restart process, the radio would stay in Idle and no event would be raised to tickle the Phy task.
Since the exact cause was never found, we'll continue to try and reproduce with a debugger attached and will let you guys know if we find any more information about it.