UART gets FIFO underflow in DMA mode

scottm · ‎10-20-2017

I just lost a few hours to this quirk and I'm hoping someone can explain it to me so I don't get bitten by this again - or at least maybe I'll save someone else some trouble if they come looking for an answer.

I'm receiving packetized data on a MK22FN1M0's UART0 at 2 Mbps. As discussed in previous threads, the IDLE interrupt is unusable in DMA mode (and not much use in general) because it can't be cleared safely. To deal safely with high speed data with a minimum of interrupts while keeping latency low I have to set up a DMA channel to write incoming data continuously into a circular buffer.

The major loop counter is used only to generate 'done' signals for each received packet - thankfully with this protocol the application always knows how much data to expect - and the transfer runs continuously regardless. The new major loop iteration count is set at the start of each packet (if the expected packet is not entirely in the buffer already) and has to take into account the number of bytes already received.

To do this safely, I (according to instructions I found here) had it disable ERQ for the channel and wait for the ACTIVE flag to clear before reading DADDR to find the current position. It then sets CITER and BITER, re-enables INTMAJOR, and finally sets ERQ again.

The problem with this is that there's apparently some kind of race condition in the DMA hardware. If a new byte came in to the UART FIFO before ERQ was re-enabled, it'd (at least sometimes) cause a FIFO underflow error. As far as I know, DMA operation should never cause an underflow.

I tried disabling RIE in the UART so it wouldn't generate DMA requests, I tried disabling the request in the DMXMUX, and I even tried polling S1 for RDRF and reading pending bytes out before restarting the DMA channel, and none of that worked.

What did finally work (so far, anyway - I'll need to do more testing) was to set HALT in DMA_CR before clearing ERQ, and then clearing HALT again before setting ERQ.

What's going on here? Is this behavior documented somewhere?

Thanks,

Scott

mjbcswitzerland · ‎10-21-2017

Hi Scott

I never stop the DMA and use DMA_TCD_CITER_ELINK to calculate the Rx progress. In a product operating on multiple UARTs at 2Mb/s (in production in large quantities since 2013) there is no known issue in doing this - where did you read the requirement to disable ERQ?

Regards

Mark

scottm · ‎10-21-2017

Hi Mark,

The point is to avoid a race condition where the DMA channel receives another byte between the time the progress is checked and when the new major loop count is set. I'm using DADDR rather than CITER because it's possible multiple major loops may have finished since it was last serviced.

For example, in the idle state it's expecting a 4-byte header and BITER is 4. The incoming packet could be hundreds of bytes, and if it doesn't get around to checking CITER within 20 microseconds the packet could have looped the counter multiple times.

But whether it's using DADDR or CITER, it seems to me that either way it needs to be sure the count doesn't change during the update. I'm doing that by disabling the request, waiting for the minor loop (of 1 byte) to finish, and then reading and updating. That's done in a byte time or less, so the UART FIFO has plenty of space.

The count needs to be exact because if BITER is set a byte too high (e.g, it didn't take into account a byte received after the receive count was checked) the transfer might never complete and would have to be cleaned up with a periodic interrupt.

How do you deal with the potential for a race condition in your code?

Scott

mjbcswitzerland · ‎10-21-2017

Scott

I just let the Rx DMA free run all the time and then there are no race states. The code polls the state in the idle task and handles any reception that it happens to find.

Simplified (polled):

static unsigned long ulDMA_progress = 0;

static unsigned long charsWaiting = 0;

unsigned long ulDMA_rx = ptrDMA_TCD->DMA_TCD_CITER_ELINK; // snap-shot of DMA reception progress
        if (ulDMA_progress >= ulDMA_rx) { // nothing or circular-loop-back in the meantime
            charsWaiting += (ulDMA_progress - ulDMA_rx); // the extra number of characters received by DMA "since last check"
        }
        else {           // extra linear received
            charsWaiting += ulDMA_progress;
            charsWaiting += (ptrDMA_TCD->DMA_TCD_BITER_ELINK - ulDMA_rx); // the extra number of characters received by DMA "since last check"
        }
        ulDMA_progress = ulDMA_rx;                              // remember the check state

charsWaiting is decremented when the higher level reads waiting data.

This is general and doesn't change the DMA to match to a protocol because that does introduce race states that will need to be be handled specifically as you are doing. Instead I just give the Rx DMA buffer lots of memory so that there is no risk of losing reception data and any higher levels can handle the data as it wishes (without any urgency).

Regards

Mark

scottm · ‎10-21-2017

This is essentially how I've been doing it, but to get the performance I need the general solution isn't going to work. It doesn't let me interrupt a low priority task to handle a packet right away and it precludes the use of a low-power idle state.

I'm also working on making this as close to a zero-copy implementation as I can, and there are other quirks to deal with. Incoming event packets (things like incoming TCP data) are sent by pointer to a queue that processes them in sequence, but tasks sending data have to block until they get an ACK response back, so the serial handler handles those separately rather than enqueueing them with the events.

It's not completely zero-copy right now because packets will wrap around the circular buffer. In that case they're copied to a linear buffer first. The only really large ones are particular oversize UDP packets handled by one module, and in another application-specific optimization I'll probably rewrite that part to account for the circular buffer and skip copying it out, since the useful information in them is a small fraction of the packet size.

I figure I can also use another DMA channel with the source modulo configured the same as the RX buffer to do efficient memory-to-memory transfers without having to explicitly deal with the wraparound when I do have to copy packets.

The real challenge is the requirement to wait for ACKs. The system might get an HTTP request that requires sending back a bunch of data, each chunk of which gets an ACK - and meanwhile, requests can keep coming in. Even with the ACKs extracted and dispatched immediately, the unprocessed requests pile up until the buffer fills and RTS asserts and then there's no way to see the ACKs, so it falls back to a timeout, which kills the performance.

That's something inherent in the protocol and I don't think there's any 100% solution without unlimited RAM - I just have to process things as fast as possible, make the most efficient possible use of buffer space, and maybe defer sending of outbound packets that might generate lots of responses.

Scott

mjbcswitzerland · ‎10-22-2017

Scott

Yes, I disable low power operating when using high speed UART and Rx DMA. However, it doesn't make a huge difference since one can only use the WAIT mode together with high speed UART anyway since it takes too long to wake from VLPS (for example). [The mentioned reference product also has dual Ethernet and so is not the lowest power HW either]

If I needed to do this (support low power state) I would look into using an edge interrupt on the UART input to wake it (from WAIT) - noting that it is the pending interrupt that wakes and so it is not necessary to actually service it. This would allow the idle task to immediately service newly arrived UART data out of the WAIT state.

Your case is application specific so you obviously are trying to optimise a certain case instead of providing a general solution for a variety of general requirements (which won't achieve your level of customised performance).

Regards

Mark

scottm · ‎10-23-2017

A general solution would be so much easier if the IDLE interrupt just worked in a usable way! If you know you can use flow control to guarantee no incoming data for some microseconds you can clear IDLE then and only disable ILIE in the ISR. It at least gives you a quick wake up from long idle periods.

My code is working well now. Next project is SPI flash optimization. DMA's working there, and with a bit of extra logic I ought to be able to have the DMA transaction continue after the requested LBA has been received and keep loading the entire sector into cache in anticipation of the next read request.

I don't actually need a lot of power saving optimization on the current projects this code is going in - both control power-hungry components and even shutting off the MCU entirely wouldn't make a huge difference, but this platform is probably going to be the basis of my next 5 years or more of designs and I'm trying to make it as efficient as possible.

Thanks,

Scott

UART gets FIFO underflow in DMA mode

UART gets FIFO underflow in DMA mode

Kinetis K Series MCUs