LPUART driver bug

kevin_haake · ‎08-10-2020

There is a UART driver issue that is present in SDK_2.7.0. While I have not tested it, I believe it is still present in SDK_2.8. I am using a MKE14F512VLL16. In my case I am also using FreeRTOS and have one task handling sending data out the UART and another task reading data from the UART, concurrently.

Things usually appear to work fine. However, we have witnessed a small number of mysterious device lockups and so I was tasked with looking into it. What I found was that in this bad state, the device would repeatedly receive a Tx interrupt. Inside of LPUART_TransferHandleIRQ() , the kLPUART_TxDataRegEmptyFlag would be set in STAT, and the kLPUART_TxDataRegEmptyInterruptEnable flag would be set in CTRL. However, the handle->txDataSize field would unexpectedly be zero. This caused an endless interrupt loop.

I believe the root cause of this problem is a small window of opportunity where a new Rx request would be made by the application (via LPUART_TransferReceiveNonBlocking()). Among other things, this request calls LPUART_DisableInterrupts() and LPUART_EnableInterrupts() to clear/set various interrupt flags. Normally this is fine. However, there is a problem in the following scenario:

1. IF within LPUART_EnableInterrupts (for example), the following line starts to be executed:
base->CTRL |= mask;

2. A Tx interrupt immediately occurs causing the last byte of a transfer to be written out. Note that in this case, the LPUART_TransferHandleIRQ() will do the following:
base->CTRL = (base->CTRL & ~LPUART_CTRL_TIE_MASK);
which (correctly) clears the Tx interrupt flag.

But, because the operation in #1 is not atomic (*), if the interrupt occurred after the read instruction of the read/modify/write instruction sequence in #1, but before the write instruction, when the ISR returns and the application call continues within LPUART_EnableInterrupts(), it will write stale data back to CTRL, with the Tx interrupt flag still set. This causes the lockup.

There might be a more elegant way to fix this, but one fix for this case is to disable the UART IRQs at the beginning of LPUART_DisableInterrupts() and LPUART_EnableInterrupts() and then reenable them before the function returns. For example:

void LPUART_EnableInterrupts(LPUART_Type *base, uint32_t mask)

{

// First NEW CODE block start

uint32_t instance = LPUART_GetInstance(base);

#if defined(FSL_FEATURE_LPUART_HAS_SEPARATE_RX_TX_IRQ) && FSL_FEATURE_LPUART_HAS_SEPARATE_RX_TX_IRQ

DisableIRQ(s_lpuartTxIRQ[instance]);

DisableIRQ(s_lpuartRxIRQ[instance]);

#else

DisableIRQ(s_lpuartIRQ[instance]);

#endif

// First NEW CODE block end

base->BAUD |= ((mask << 8U) & (LPUART_BAUD_LBKDIE_MASK | LPUART_BAUD_RXEDGIE_MASK));

#if defined(FSL_FEATURE_LPUART_HAS_FIFO) && FSL_FEATURE_LPUART_HAS_FIFO

base->FIFO = (base->FIFO & ~(LPUART_FIFO_TXOF_MASK | LPUART_FIFO_RXUF_MASK)) |

((mask << 8U) & (LPUART_FIFO_TXOFE_MASK | LPUART_FIFO_RXUFE_MASK));

#endif

mask &= 0xFFFFFF00U;

base->CTRL |= mask;

// Second NEW CODE block start

#if defined(FSL_FEATURE_LPUART_HAS_SEPARATE_RX_TX_IRQ) && FSL_FEATURE_LPUART_HAS_SEPARATE_RX_TX_IRQ

EnableIRQ(s_lpuartRxIRQ[instance]);

EnableIRQ(s_lpuartTxIRQ[instance]);

#else

EnableIRQ(s_lpuartIRQ[instance]);

#endif

// Second NEW CODE block end

}

-------------------------------------------------------------------------------------

(*) The disassembly of base->CTRL |= mask shows the following (non-atomic):

00066a58: ldr r3, [r7, #4]

00066a5a: ldr r2, [r3, #24] // copies base->CTRL into temporary register

00066a5c: ldr r3, [r7, #0] // WINDOW OF OPPORTUNITY exists here if interrupt

00066a5e: orrs r2, r3 // occurs and ISR modifies CTRL flag. If it does,

00066a60: ldr r3, [r7, #4] // our CTRL copy will now be stale!

00066a62: str r2, [r3, #24] // writes potentially stale result of OR back to base->CTRL,

666 } // potentially losing ISR change

alexmarin · ‎11-17-2020

Any updates on this? We've hit the same bug on SDK 2.8.0.

We're communicating with our ble chip via lpuart, sending data from one task and receiving from another.

Randomly we hit the exact lockup described in this topic.

kevin_haake · ‎11-17-2020

I haven't had time to put together an example project that reproduces the problem, but I can confirm it still exists. The 'fix' I suggested above seems to work in my use case: LEVEL_1 interrupts should be disabled in LPUART_EnableInterrupts() and LPUART_DisableInterrupts() before modifying base->CTRL, and then the LEVEL_1 interrupts can be reenabled before these functions return. (via DisableIRQ() and EnableIRQ()).

One thing that I didn't outline above is that the FreeRTOS driver functions LPUART_RTOS_Send() and LPUART_RTOS_Receive() should also be modified so that the send and receive non-blocking calls occur in a critical section, like the following:

taskENTER_CRITICAL();
LPUART_TransferSendNonBlocking(handle->base, handle->t_state, &handle->txTransfer);
taskEXIT_CRITICAL();

and

taskENTER_CRITICAL();
LPUART_TransferReceiveNonBlocking(handle->base, handle->t_state, &handle->rxTransfer, &n);
taskEXIT_CRITICAL();

This prevents the LEVEL_1 interrupts being disabled for a potentially long time due to a low priority task doing the disabling and then getting preempted by a higher priority task before reenabling.

The reason the LEVEL_1 interrupts need to be disabled before modifying base->CTRL is that LPUART_TransferHandleIRQ() also modifies base->CTRL, and this modification is not an atomic operation. This can cause an interrupt flag to get stuck on (ex: LPUART_CTRL_TIE).

alexmarin · ‎02-05-2021

Hello,

This was fixed by the SDK team, it should be released in SDK 2.10.

Kind regards,

Alex

kevin_haake · ‎02-05-2021

Super - thank you.

danielchen · ‎09-09-2020

could you please send me your simple project for me to reproduce?

Regards

Daniel

kevin_haake · ‎09-10-2020

My current project is too complex to split up easily, but I'll try to put something together for you that exhibits the issue and post it.

I could be wrong, but I think as long as you have two freeRTOS tasks, one reading a byte at a time from a UART (that has data being written to it) and the other writing to the same UART, repeatedly, the problem will eventually show up.

In my case, one of my tasks is repeatedly calling LPUART_RTOS_Receive() and the other is repeatedly calling LPUART_RTOS_Send().

Note that LPUART_RTOS_Receive() eventually calls LPUART_TransferReceiveNonBlocking() and waits for the transfer to finish. The latter calls both LPUART_DisableInterrupts and LPUART_EnableInterrupts while it is setting up the transfer. (I am using a 32 byte ring buffer).

LPUART_RTOS_Send() eventually calls LPUART_TransferSendNonBlocking() and waits for the transfer to finish.

The trick is that the start of a Rx transfer needs to be initiated immediately before a previously started Tx transfer finishes. The latter has to interrupt the Rx task and finish the transfer (i.e. modify base->CTRL) just as the former is in the middle of setting base->CTRL inside of LPUART_DisableInterrupts() or LPUART_EnableInterrupts(). This window is small, but it exists.

danielchen · ‎09-02-2020

Hi Cavin:

Thank you very much for your sharing. I will try to reproduce your issue on my side.

Regards

Daniel

kevin_haake · ‎09-08-2020

Thank you, Daniel. For what it is worth, I was able to able to reproduce the issue somewhat reliably by simply increasing the frequency of the reads and writes to multiple transfers a second. Even so, this would produce a lockup only once every 3 to 4 hours or so on average.

In my case, the transmit transfers were generally short (generally less than 100 bytes each), and the read transfers were only a single byte at a time. (Many more read transfers than write transfers).

LPUART driver bug

LPUART driver bug

General