iMX6sx UART FiFo overrun

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

iMX6sx UART FiFo overrun

6,963 Views
marco_reppenhag
Contributor II

Hi everyone!

We're facing a performance problem while using iMX6SX (ARM CPU only):

During the development of our custom board based on i.MX6SX under LINUX we tripped over a "bug" we have to worry about:

First, let me explain our general conditions:

We are using up to four UARTS as asynchrony UART interfaces as RS485 full duplex without RTS/CTS under LINUX (linux-yocto-3.19-3.19 with some back port of 3.19.5)

As shown in the flysheet, we planned to use speeds of 500kBaud streaming our data into the ARM Core of the i.MX6sx, which should be capable to handle speeds up to a few mega Baud.

Depended on the bandwidth, this speed will not be reached by far under linux operating system.

Now  let me explain the problem:

We are using a terminal on one UART to communicate with the LINUX system without any error at speeds of over 460800 Baud.

So only "some" characters transmitted with very high speed over UART seem not to be a problem.

On the other hand, if we try to stream a lot of characters with high speed, the FIFO runs over an Data get lost.

First, we recognized that the compatibility for SDMA to MX6-family was not given in UART driver.

So we started without DMA and got lost in underperformance.

After bug fixing this, we are now able to use DMA in the usual way and the performance rises up:

Now we can stream up to 115200 Baud, tested with simultaneously 3 UARTs and maximum bandwidth without any error. So far, so good.

But our goal is out of reach:

If we switch to 230400 Baud, we got significant error rates, even if only some percent of bandwidth is used, only one UART connected.

So we had to investigate what happen:

First, we recognized that all errors based on "FIFO OVERRUNS". This means, that the CPU is not able to drain the FIFO before it runs over.

Well, the system is quite in idle state, no interrupts get lost and the schedulers are working without any sign of effort.

Using "Polling" without DMA makes it worse. Playing around with Watermark and Burstlevel shows an effect, but not sufficiently....

So we tried a lot to fix this:

We optimized threshold of FIFO, Size of DMA-blocks, tuning IOSCHEDULE, turn off everything but the UART handling... cutting down the driver to

it's rudimentary task. We found out, that even the necessary step, which copies the FIFO via DMA into DDR-RAM, is too slow to keep UART speeds above 230400

error-free with higher bandwidth than an console- UART.

No matter what we tried to fix it: This bug still remain.

So: Are we realy facing the bottleneck of the Bus while using UART without CTS/RTS functionality?

Or is there someone out there, who solved such a problem on his board?

Thanks a lot!

Marco

Labels (1)
13 Replies

4,264 Views
oleksiy_slyshyk
Contributor I

Hi all.

I encounter the same problems but in i.MX6ULL. Does something change from 2016?

0 Kudos
Reply

4,264 Views
marco_reppenhag
Contributor II

Well, in our scenery we do not use two i.MX6sx... only one and an external device based on an common DSP o.a. brand.

But I will check the kernel from recommended BSP as soon as possible...

(What I guess... you are using TWO i.MX6sx... so the transmitter send data the same way, the receiver  is receiving it....

The UART TX is (like RX) connected the same way with same FIFO-Buffer size and arbitration:

If you use two UARTs, which works identically... one is sending max.32 Bytes out of Fifo and wait for another scheduling time slice to transmit the next chunks. THIS time is enough for RX to empty FIFO. Chunk speed will be a few MBit/s... but overall speed is reduced due to the pauses. This I call "relative Speed". It's the same with loopback performance checks: We do not get any problem during our checkup with loopback devices.... but we also tested with oscilloscop: There are huge gaps during transmition from other periphery (obsolete periphery of mine is also not able to stream "full" bandwith, uses chunk transmission of 16 Bytes.. than a pause.. another chunk.. even the TX FIFO have to be filled, and there are less problems when TX and RX FIFOs do it the same way.)

But in our scenery the periphery is able to stream without much waiting... it use permanent bandwith of over 90%... so "absolute speed" will be 230400.

Have you tried to transmit 1Mbit/s from PC UART to i.MX6sx?

So I would like to speak about "REAL" speed and "RELATIVE" speed. If "Pauses" are caused by HW Flow Control, SW Flow Control or scheduled filling of the TX-FIFO -> The absolute Speed will not reach full bandwith. But using a periphery which is able to send chunks "nearly gapless"... real streaming... the requirement to the receiver will be: Empty FIFO in the timeslice between two chunks, which may be quite small... So I checked this out using an oscilloscope measuring datastream. Our obsolote periphery will cause gaps up to msecs between the "FIFO-sized" chunks, which are transmitted in 500MBaud... so If you are interested in finding out what's going on: Do some measurement.. it's very interesting :-)  )

4,264 Views
carlohwang
Contributor I

Hi! I found the same problem with my imx6dl custom board which base on SabreSD. I have written a uart test program that recive 30 bytes from anther board every 2ms .  if I turn on the debug info( which display on tty0), the linux kernel will report "rx fifo overflow" every 1s. Then I remove debug info, the kernel won't reprot overflow info any more. So I think the linux kernel can't process the real time rx/tx data. Maybe adding RTS/CTS is the only way to resolve this problem!

0 Kudos
Reply

4,264 Views
DuanFugang
NXP Employee
NXP Employee

hi, Marco,

The test condition is two boards connect directly by uart port, one board send  raw data to another board,  the other board receive data and double check the data validity.  No hw flow control enable.

Regards,

Andy

0 Kudos
Reply

4,264 Views
marco_reppenhag
Contributor II

Hi Andy

HOW did you test the transfer?

(RAW data in streams sized a few MByte ? OR in intervalls... what is your "stresstest" about? Frame conditions? SW flow control?)

Again: We are not able to use HW nor SW Flow control.

Explanation ist simple: The Peripherie we are connecting is not able to handle sw/hw  fc.

(Until now a small ATX-Mega do the job well, but not really comfortable. So we decided to use Linux on a Arm....)

I will take a look to the DIFF of the BSP-Kernel mentioned above and the kernel of mine ... merge the relevant diffs an test again.

(Give a few days to do this... "I'll be back" :smileyhappy:  )

Thanks: Marco.

0 Kudos
Reply

4,264 Views
DuanFugang
NXP Employee
NXP Employee

I double test DMA mode without HW flow control on i.MX6SX sdb board (two boards connect directly with uart5 port, one send, one receive), it works fine for 1M/2M/3M/4Mbps baud rate, no fifo overrun.

Regards,

Andy

0 Kudos
Reply

4,264 Views
DuanFugang
NXP Employee
NXP Employee

hi,

We did test on i.MX6SX uart function/stress test,  CPU/DMA mode pass on 115200/1M/2M/3M/4M bps baud rate with HW flow control.

The FSL BSP version you can try it: 3.10.53GA,  3.14.52 GA.

(FSL BSP uart/dma driver and dma firmwarm have change and improvement comparing with community)

We always suggest to enable HW flow control to avoid data loss/fifo overrun for CPU/DMA mode.

If without HW flow control,  i suggest to use DMA mode that is better.  I will double test this case on fsl bsp today.

Regards,

Andy

0 Kudos
Reply

4,264 Views
marco_reppenhag
Contributor II

Hi Andy...

Let me ask you a question first:

Did you tried out asynchronous higher speed transfer (at least 230400 Baud, no HW/SW FlControl, async... full fire transfer) on any other UART than UART1 with any kind of i.MX6sx Board, any kind of Linux run on top?

If so and you do not get multiple RX-errors, let me know which it is or which BSP you are using making this possible.

I do not want to discuss about BSP stuff, because we use an own build environment, own rootfilesystem, bootloader,etc...

The similarity is the kernel, which provides the hardware abstraction. In my case: linux-yocto-3.19-3.19 (out of the yocto repository FSL told me) build with Ubuntu/Linaro 4.7.3-12ubuntu1 Toolchain out of the official LTS Ubuntu repository. (Also tried out 3.19.5 and earlier... ).

I did some experiments, stripped down the driver "imx.c" to it's rudamantary elements and found out: First stage copying from periphery into DDR memory (with, or without DMA)  take too long and exactly this is my problem which have to be solved... and... once again: This problem is located in the kernel driver of the DMA or IRQ path in the Kernel, or in the physics of the i.MX6sx...

As mentioned above (Thank you Rodrigue!):

Maybe using FIQ can help to lower lattency. Maybe special SDMA-Firmware can help to speed up transfers using "burst mode"... MAYBE!

But everything is useless, if the i.MX6sx is not able to provide the speed, because arbitration and bus management are the bottleneck... Then I want to know it before I spend a lot of time.

My goal ist not doing the same just another way... I have to eliminate a show stopper.

So: Is it possible to drive UART2..5 asynchronous with speed of at least 230400 (high bandwith streaming, no HW/SW control, async), or are the physical limitations (arbitration, bus speed, width...) too big?

We do not  know this, until somebody else tried this out.

(In my environment, I was not able to use speeds above 115200 Baud... whatever I tried...).

Exactly this is the question it's all about... and after answering: "YES! This is possible! I did it..." I would like to ask what to do, that I'm able, too.

0 Kudos
Reply

4,264 Views
DuanFugang
NXP Employee
NXP Employee

hi,

Which fsl BSP release version do you test ?

Regards,

Andy

0 Kudos
Reply

4,264 Views
marco_reppenhag
Contributor II

Hi Rodrigue,

we're not using thumb and we recognized, that UART1 is quite special. The assigned IRQ number is very low, from the other UARTs it's high... and: On UART 1 it is possible to transfer data with 500000 Baud.

I noticed, that SDMA is bound to 133 MHz bus and share it's bandwith with other peripherals, but... well: That's not really slow. I missed a map of the busses for this CPU. There's only a block diagram, without any information about bottlenecks.

We picked the i.MX6SX, because it offers 6 UARTs with speed up to a few MBaud and now we recognized that only one UART is able to provide higher speeds and until now, all signs are leading to the result, that we back on the wrong horse?

Wondering about one thing: Evan the usage of one of these "slower bound" UARTs in lousy 230400 Baud will cause errors. And it's exaclty the same without DMA: To lower latency it's usually a good idea to disable DMA an transfer the few bytes directly... but in a very calm system without much traffic on the busses (only filesystem and UART 1 is idling...) it was not possible to transfer data with speed above 115200 error free, and this is really scary....

But there are still two chances:

  1. reconfigure the IRQ controller, maybe use FIQ for the UARTs we have to use.
  2. As mentioned above: Modify SDMA to enable burst mode.

But this will take some time, because first we have to finish other work, than I surely come back trying to fix this.

Thanks a lot,

Marco

0 Kudos
Reply

4,264 Views
Yuri
NXP Employee
NXP Employee

Hello,

    Basically, for DMA, hardware flow control (CTS\RTS) should be applied.

Nevertheless, You may try to use XON/XOFF software flow control.   


Have a great day,
Yuri

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

0 Kudos
Reply

4,264 Views
marco_reppenhag
Contributor II

Hello and thank you for quick response.


Don't get me wrong: I've to explain more about what we are doing and have to ask for more help... urgently needed ;-)

So, let's start:

We are not using the UART as consoles, where one time only one charachter will be send... other time a few more...

No, we are streaming a lot of Data over UART -> So we fill the FIFO continously and the driver (imx.c) starts DMA from the moment, the FIFO is half filled (16, also tried other values...best is much less than half-Fifo...about 8). So we use not so high transfer rates, but high workload.

We are also not facing the aging problem of "starting transfer too late", because watermark level is reached earlier than timers will expire.

Some timestamp debugging shows clearly, evan the scattergatter copying (everything alse skipped) slows down everything, so overruns may appear.... well: This also may be the indication of something else, of course. But copying via DMA itself takes more time than FIFO filling with mingy speed of 230kbaud?

Of course a BREAK like flowcontrol may prevent the iMX6SX from overruns... but the other side have to wait and it's transmit buffers will overflow. So this may only be the dislocation of the problem?

We have to transfer a lot of data in  "not too high" speed and I'm wondering about the bottleneck, because we're talking about 230400 bit/s,... so FIFO will overrun in  about 1,5 msec, which are pretty much cpu-cycles, interrupt driven... that's what I'm wondering about.

BTW: To lower latency we tried out transfer without DMA... no chance: This works much slower. So DMA makes it better, but not fast enough.

So is something out of my focus?

Last not least: Thank you for attending my problem: Marco.

0 Kudos
Reply

4,264 Views
Rodrigue
NXP Employee
NXP Employee

HI Marco,

Which UART are you using?

First UART1 is not on the same bus than the other 4, and maybe be faster.

Second, the UART is on Sub-buses and the data have to travel across several arbiter, with a lot of turn over time being lost.

How are you compiling you code? in thumb? if so please try to compile in normal instruction set. This might explain why sometimes the number of CPU cycles exceed by far what is necessary to actually to transfer the data.

Note here that the SDMA is a 133MHz core, but still, it should be able to transfer the data on time.

What other interfaces are you using?

In the worst case, we need to have a look at the SDMA binary itself, and maybe append some code to enable Burst DMA mode.

Do you think you could create a bench on our SoloX evk using our BSP (3.14.52)? BT UART is accessible over the J19 connector (for BT).

best regards,

Rodrigue