Race condition in MCUXpresso USART Driver

otsl · ‎03-24-2020

The code contained in fsl_usart.c driver contains a race condition on transmit. The error will occur in systems that have non-blocking transmit and receive happening simultaneously.

Starting at line 588 of the file (function USART_TransferSendNonBlocking), the following is used to start a transfer:

handle->txData = xfer->data;
handle->txDataSize = xfer->dataSize;
handle->txDataSizeAll = xfer->dataSize;
handle->txState = (uint8_t)kUSART_TxBusy;
/* Enable transmiter interrupt. */
base->FIFOINTENSET |= USART_FIFOINTENSET_TXLVL_MASK;

If a USART interrupt (such as a receive interrupt, etc) happens between setting txDataSize, and enabling the interrupt, the ISR will execute. Inside the ISR, a boolean (sendEnable) is set based on txDataSize. The main 'while' loop that sends/receives data uses this flag in conjunction with the FIFOSTAT register to (correctly) transmit the requested byte, decrement txDataSize, and enable the completion interrupt (TX_IDLEEN).

The ISR then completes, and returns to USART_TransferSendNonBlocking that now enables the TXLVL interrupt.

When the TXLVL interrupt triggers, txDataSize is zero (because it was previously transmitted), and the ISR (as written) never clears it, because the sendEnable flag is never true. This causes the CPU to lock up servicing the USART IRQ. Depending on overall timing of my RTOS system, I've seen it lock up stuck with TXIDLEEN flag on (but txState not correctly set to kUSART_TxIdle) or with TXLVL interrupt enabled, but txDataSize == 0.

This whole mess comes about because the ISR uses multiple, non-atomic ways to determine what state the transmit engine is in (txDataSize, TXNOTFULL flag, txState). The correct way to sort this out would be to clean up the ISR to only move between states based on interrupt flags (which can be set/cleared atomically with the SET and CLR registers).

As a temporary fix, you can protect the critical lines above by wrapping them with:

regPrimask = DisableGlobalIRQ();
...
EnableGlobalIRQ(regPrimask);

Sabina_Bruce · ‎03-24-2020

Hello,

Could you please tell us the microcontroller you are using so we can assign it to the right person.

Thank you.

otsl · ‎03-25-2020

The microcontroller is the LPC55S69 (added in tags)

Sabina_Bruce · ‎03-26-2020

Thank you for providing the information on the microcontroller you are using as well as the behavior you are experiencing.

Could you provide an example that we can reproduce this bug you are reporting, on our end using the LPC55S69-EVK.

Also are you using the latest SDK version and IDE?

Best Regards,

Sabina

-----------------------------------------------------------------------------------------------------------------------

Note: If this post answers your question, please click the Correct Answer button. Thank you!

-----------------------------------------------------------------------------------------------------------------------

otsl · ‎03-27-2020

IDE v11.1.1 [Build 3241]

USART Driver version 2.1.1

I do not have a project for the EVK (the code is for a custom design). If you examine the code you can deduce the race condition. If you want to reproduce, you can create a simple project. Essentially, my project has the following key characteristics that I believe contribute to the issue:

- A slow main clock rate: 10MHz when the issue happens -- this provides for a longer opportunity for the critical section to be interrupted by an external source.

- The TX and RX portion of the UART are controlled by different tasks in FreeRTOS (although I don't believe this is a requirement).

- The baud rate is slow enough (9600bps) that the source code is essentially transmitting and receiving one byte a time (increasing the number of times the critical section is called).

Sabina_Bruce · ‎04-03-2020

Hello ,

Hope you are doing well.

I have reproduce the USARt example we provide using freertos. I've changed the parameters to work as you mention above. That is to say the main clock at 10 MHz, baudrate 9600 and two tasks for each send and receive function. However the behavior you describe does not occur. I would recommend to check the free_rtos example for usart in the SDK. In the example, you will find that the USART_TransferSendNonBlocking is wrapped by taking and giving semaphore. This has the effect that you are referring to, of disabling and enabling global interrupts.

Please let me know if this helps.

Best Regards,

Sabina

-----------------------------------------------------------------------------------------------------------------------

Note: If this post answers your question, please click the Correct Answer button. Thank you!

-----------------------------------------------------------------------------------------------------------------------

otsl · ‎04-03-2020

Actually those semphores do not have the effect I am looking for. SemaphoreGive and Take temporarily disable interrupts while they're executing. The semaphores correctly protect the uart from multiple tasks accessing the same UART. They do not protect against a task and an interrupt from accessing the UART variables at the same time.

There is a race condition in the code. Even without executing an example, if you carefully examine the code where I indicated you can see the opportunity for a race condition. I have observed it with my code (and verified with breakpoints that the TXLVL interrupt was enabled with DataSize of 0, causing continuous interrupts -- just as you would expect with careful examination of the code), and have 0% observed it when I made the indicated changes.

Due to the nature of race conditions that are spawned by external stimulus, you may have to tweak your external inputs to get the erroneous behavior. In my particular example, I had two tasks: one receiving characters, and one echoing the same characters back out one-at-a-time immediately after receiving them (I used a stream buffer to connect the two tasks, but that is irrelevant to the error). The external device generating the characters did so in small bursts (maybe 10-20 characters at a time) and had some inherent jitter between characters and bursts, and I had other interrupts executing at the same time on the microprocessor, reducing even more the deterministic behavior of the code. Even with these, It would sometimes take an hour for the issue to happen (after transferring 100's of thousands of bytes).

It doesn't surprise me that you may have to tweak your example to reproduce. It is a very particular fault. That is also why I so clearly explained the mechanism, so that you could deduce the race condition and see where the vulnerability lay without needing to execute code.

If you *must* reproduce it, you may have to get creative. You can change your driver to increase the likelihood of it happen (but not changing the basic functionality) by doing something like putting a NOP delay loop inside the critical section (just after the handle->txDataSize = xfer->dataSize line). This would not affect the functionality of the driver, just slow it down to increase the likelihood of an RX interrupt occurring during the critical part of the code. You may also use hardware to "cause" the error -- eg, you could toggle an I/O right before the NOP loop, and that I/O could trigger a separate device to transmit a character, guaranteeing you an RX interrupt in that location. For that matter, you could also try to generate a fake RX character received interrupt right after that line.

rongs · ‎05-28-2020

I'm seeing the same issue on a different chipset (i.MX RT CPU) with a similar driver. Your analysis and fix recommendation were simply spot-on. Thank you!

Before jumping into a major overhaul of the interrupt handler, I did a quick change as below, which seems to fix the continuous TXLVL interrupt issue for me. Would you be able to give it a quick try to see if it fixes the other type of lockup issue you were seeing (stuck with TXIDLEEN flag on)?

--- a/drivers/fsl_usart.c
+++ b/drivers/fsl_usart.c
@@ -953,7 +953,7 @@ void USART_TransferHandleIRQ(USART_Type *base, usart_handle_t *handle)
assert((NULL != base) && (NULL != handle));

bool receiveEnabled = ((handle->rxDataSize != 0U) || (handle->rxRingBuffer != NULL));
- bool sendEnabled = (handle->txDataSize != 0U);
+ bool sendEnabled = (handle->txDataSize != 0U) && (base->FIFOINTENSET & USART_FIFOINTENSET_TXLVL_MASK);

Also I think the SPI driver (fsl_spi.c) is done in a similar way, so I wouldn't be surprised that it has the same race condition issue.

otsl · ‎05-28-2020

Hi Rong.

(Disclaimer -- this project has been complete for us for about 2 months, so I'm going a bit from memory here).

When I looked at fixing this bug, I considered doing something similar to what you did. I'm not sure if the drivers are exactly the same, but in my particular instance the fix you suggest would never properly disable the TXLVL interrupt. It's quite possible that the drivers have diverged for our two parts, and your driver is written a little more robustly (although it appears not robustly enough)....

Here's why it wouldn't fix my issue:

In the ISR, there are two conditional blocks in my driver that handle the sending functions. The first is below. Since sendEnable in your fix is 0, this block will never execute:

if (sendEnabled && ((base->FIFOSTAT & USART_FIFOSTAT_TXNOTFULL_MASK) != 0U))
{
   base->FIFOWR = *handle->txData;
   handle->txDataSize--;
   handle->txData++;
   sendEnabled = handle->txDataSize != 0U;
   if (!sendEnabled)
   {
      base->FIFOINTENCLR = USART_FIFOINTENCLR_TXLVL_MASK;
      handle->txState = (uint8_t)kUSART_TxIdle;

base->INTENSET |= USART_INTENSET_TXIDLEEN_MASK;
}
}

The second is just below it. This one requires handle->txState to be kUSART_TxIdle, however txState is set to kUSART_TxBusy during the race condition, so this block never gets executed either.

if ((0U != (base->INTENSET & USART_INTENSET_TXIDLEEN_MASK)) &&
   (0U != (base->INTSTAT & USART_INTSTAT_TXIDLE_MASK)) && (handle->txState ==    (uint8_t)kUSART_TxIdle))
{
   /* Disable tx idle interrupt */
   base->INTENCLR |= USART_INTENCLR_TXIDLECLR_MASK;
   /* Trigger callback. */
   if (handle->callback != NULL)
   {
      handle->callback(base, handle, kStatus_USART_TxIdle, handle->userData);
   }
}

This all means that in my driver, the interrupt flags would remain enabled with your fix, and therefore the processor would remain locked-up.

rongs · ‎05-28-2020

Hi otsl,

There are some minor differences between my and your driver: in the first block, my driver doesn't modify txState, and the if-statement of the 2nd block doesn't require txState to be IDLE. Seems mine driver is a newer version as txState should only be modified in the 2nd block where callback is called.

With these two differences in mind, I think my fix might work (I'm yet to fully test it) and it's much light-weighted as it doesn't require disabling interrupts globally.

Sabina_Bruce · ‎05-28-2020

Hope you are well. I apologize for the delayed response, there was a mixup and I misplaced the thread of your post. A ticket was created regarding this situation with our SDK team. They are currently working on it and hopefully will be fixed in the upcoming release in June. Any further updates I will let you know through this medium.

Best Regards,

Sabina