FEC ethernet packetloss

MarekVasut · ‎12-05-2013

Hello!

we have an issue here when communicating over FEC ethernet on the i.MX6. We would expect that if we run continuous stream of packets, the MX6 would be able to transfer them without dropping any of them. The problem is that we observe sporadic packet loss when both ends of the link operate in 1000 Mbps / full duplex mode. The packets are occasionally not transmitted FROM the FEC ethernet TOWARDS the host PC. We see first few dropped packets after roughly 5 hours of continuous transfer. After that, we see a few dropped packets every 2-3 hours. We produced exact steps to reproduce these packetloss issues, maybe someone has an idea? Thank you !

Steps:

HOSTPC: We have Intel i7 820QM with Intel i82577LM ethernet (e1000e driver) && Intel i7 3970X with i82579LM (e1000e driver) ethernet.

TARGET: We have MX6Q SabreAuto , MX6Q SabreLite and two custom boards, one with MX6Solo and other with MX6Dual. All use FEC ethernet for this test.

Any combination of TARGET and HOSTPC above have these symptoms. For your convenience, you can try with SabreAuto as the TARGET platform.

1) Connect TARGET directly through a 50cm CAT6 ethernet cable with a HOSTPC.

2) Boot Freescale Linux 3.0.35-4.1.0 (in default imx6_defconfig configuration for sabreauto, in slightly modified configuration for the custom mx6dual and mx6solo boards) on TARGET.

3) Boot the TARGET into userland on SD card, install "iperf" tool.

4) Make sure the link is in 1000/FD mode on HOST:

$ ethtool -s eth0 speed 1000 duplex full

5) Make sure the link is in 1000/FD mode on TARGET:

$ ethtool -s eth0 speed 1000 duplex full

6) Disable any possibly interfering network managers etc. on both ends:

$ /etc/init.d/networking stop

$ /etc/init.d/network-manager stop

7) Bring up network interface on HOSTPC:

$ ifconfig eth0 192.168.1.1 netmask 255.255.255.0

8) Bring up network interface on TARGET:

$ ifconfig eth0 192.168.1.2 netmask 255.255.255.0

9) Start "iperf" on HOSTPC in UDP server mode:

$ iperf -u -s -l 4M -i 60

10) Start "iperf" on TARGET in UDP client mode:

$ iperf -u -c 192.168.1.1 -t 28800 -b 1000M -i 60

cxh · ‎02-27-2014

I get the same error.

When I use http to download large files, there's many packets lost.

stability of iMX6 Ethernet

We configured the fec as RMII mode.

YixingKong · ‎02-21-2014

Marek

Had your issue got resolved? If yes, we are going to close the discussion in 3 days. If you still need help, please feel free to reply with an update to this discussion.

Thanks,
Yixing

MarekVasut · ‎02-21-2014

No, it was not :-(

YixingKong · ‎02-27-2014

Marek

Could you please respond if there is any progress on your side after suggestion from our engineer?

Regards,

YixingKong · ‎02-23-2014

Here is the response from internal engineer:

You use HostPC as the slave (receiver), imx6 as the host (transmit), the packet loss may be caused by your HostPC.

And for UDP communication, it cannot make sure there have no packet loss.

MarekVasut · ‎02-28-2014

We tried multiple host systems with multiple NICs (e1000e, tg3 ...). The problem is not in the host system.

In the meantime, we found a better testcase:

Use FSL 3.10.17-beta
On HostPC, run

=> iperf -s -i 1

On target, connect an SD card and a USB stick (or two SD cards) and run:

=> sha1sum /dev/mmcblk0 & sha1sum /dev/sda

=> iperf -c 192.168.1.1 -t 1000 -i 1

The 192.168.1.1 is the IP address of the HostPC. This will assure the target system (the MX6 board) is under heavy load and that the problem will pop up real quick , in a matter of minutes.

YixingKong · ‎03-19-2014

Marek

Your issue seems a tough one.Our AE engineer had created reques for R&D team. Someone from R&D is working on it and will get back to you as soon as any thing is available.

Thanks you for your patient.

Thanks,

Yixing

YixingKong · ‎02-23-2014

Marek

Ok, I will branch your issue to an internal groupa and assign an engineer to work on it.

Regards

Yixing

EricNelson · ‎12-09-2013

Hi Marek,

Have you looked at the performance counters (ENET_IEEE_x) to see if any reported failures correlate with the dropped packets?

I assume you've seen our results at http://boundarydevices.com/i-mx6-ethernet/, and I assume you know about the bandwidth limitations on the receive side.

We haven't done much testing with constant transmission, since the bigger issues tend to be on the receive side.

DuanFugang · ‎12-05-2013

hi,

It cannot avoid UDP drop packets.

And, since imx6 ethernet hw bandwidth have limitation (400-700Mbps), your hostpc netcard bandwidth may be better than it. Use iperf to do udp bandwidth test with 1000M throughput, the drop packet will occur.

Thanks,

Andy

detlevzundel · ‎12-09-2013

Hi Andy,

can you please elaborate why you think we "cannot avoid to drop UDP packets"? Your reference to the bandwidth limitation

(that we of course know about) is not relevant, as our scenario is not a congestion problem at all.

If you read the report closely, you will find that (as far as we can see) these packets have been passed to the FEC by the driver

successfully but are _not_ transmitted to the other side. We follow this because the receiving side is faster and we have a full

duplex point to point connection so packet collissions cannot happen.

We fully acknowledge that in certain scenarios packet losses can happen at various points outside the MAC, but we consider

a system that cannot even put a packet "on the wire" to be not correct and look desperately for help on how to set this straight.

We also know that not using the FEC but an external PCIe network card does also _not solve_ the problem. Using the same network

cards in a regular PC without problems shows that they are not the source of the problem.

We really would like to use the i.MX6 in networking applications, but without a solution to this (and the PCIe problem)

we see serious problems here.

So can you please elaborate why you think this behaviour is accaptable?

EricNelson · ‎12-09-2013

Hi Detlev,

Do you have any information about the type of failure from the host side?

In essence, I'm wondering if the problem indicates that the packet was received but rejected or not received at all.

Your PC side driver should report packets that fail a CRC check and it would be useful to know if that statistic

increments when you detect packet loss.

MarekVasut · ‎12-10-2013

The packet is not received at all. The packet is not emitted on the ethernet link.

TeleLaci · ‎12-09-2013

Hi Detlev,

I could be wrong because I'm not a huge network guru, but I don't understand what is the problem at all.

UDP protocoll doesn't guarantee that your packet will be arrived. A packet might be lost anywhere on the route to the target. (On the internet or on your local network, wherever)

Your proggy (which one uses UDP) has to solve this dilemma.

TCP protocoll does guarantee that your packet will be arrived. It repeats lost packets.

Marek says that: "few dropped packets every 2-3 hours"

If those packets would go with TCP, they would be repeated. Few dropped (therefore repeated) of many thousands of succeeded packets. Simply it has no impact on performance, even 0.001 % neither, this controller error (yes its probably a controller or linux driver error) is totally negligible.

If you use UDP protocol you should not expect those packets will be arrived, anyway. Doesn't matter why did they get lost. UDP protocol doesn't check if a packet has arrived to the target. But TCP does.

I could be wrong, I'm sorry for that, I just don't understand what are you talking about, because this is so tiny negligible error, it has no impact on your ethernet speed at all.

detlevzundel · ‎12-17-2013

Hi Tele Laci,

actually in one application we don't care at all about UDP but about Ethercat Master frames being sent out from the i.MX6. We use the UDP setup to allow everybody to duplicate the problem without "exotic" Ethercat Slaves. In the Ethercat Master setup we really do care about every individual packet as it triggers one read cycle from all the slaves.

To state it explicitely again - the problem we want to solve is that the FEC controller receives packets from the Linux driver, but does _not_ put them on the wire without any indication to the contrary. Of course packet loss can occur somewhere "further down" the data path, but our test setup precludes that.

To make an analogy to our problem, think about an I2C controller that you instruct to send out a message, but it simply refuses to do this without signalling an error. I would consider this to be a hardware bug and I hope you would too :smileywink:

TeleLaci · ‎12-17-2013

Hi Detlev,

I think I understand the problem, but maybe you you don't understand my point. Probably the FEC has some kind of hidden problem. We donno what. You want to fix it. In consequence of this error you lose 1 packet from 1 million, therefore, the error has impact on the speed (because of repeated packet) about 10E-6 ... 0.0001%. (I donno the exact numbers I just said an example, but its very rare, "...first few dropped packets after roughly 5 hours of continuous transfer..." ) Sooner or later, you are gonna fix this bug I'm sure. What will be the outcome? You will improve the average long term speed by 0.0001% !!! Are you sure this is an important bug to fix? You want to spend precious hours, to hunt this error down? That is my only doubt, I think you ignore the proportions, I agree with you anyway, I respect when someone is so precise like you.

detlevzundel · ‎12-17-2013

Hi Tele Laci,

you are right, I _understand_ but I cannot follow your point. We have a deterministic CPU and deterministic functional blocks and I thus expect deterministic behaviour of the system as a whole. When we see a problem here, we need to diagnose it in order to be even sure how it will influence a final product. I believe your whole line of argument regarding the speed is "very problematic" to say the least. All we currently know is that we see a certain error rate in a certain test case. We have _no_ guarantee whatsoever that this is already the worst case as we do not even understand the problem. At this point it is perfectly feasible that somebody else will find a usage scenario where the failure rate will be much higher. Maybe it will then even surpass your current "subjective threshold" that you are obviously ready to accept?

Moreover, if I would accept such an "acceptable failure threshold" then where is the limit? Do we then also accept that the CPU may miscalculate once every billion operations? Do you accept other IP blocks to expose such (measurable) failure rates (e.g. SATA swallowing write operations)? I really do not want to go down that route of probabilistic computing and honestly, I have not read anything about that in the specification of the i.MX6 chip :smileywink:

And yes, I know that all electronic circuitry can malfunction because of radiation and other _external_ factors that we cannot control, but the chip as such still has to be deterministic in absence of such things.

TeleLaci · ‎12-17-2013

Hi Detlev,

You forgot something. Computer networks are always fault tolerant systems, don't grab my thoughts out from that context. I have no idea, where are the "just acceptable thresholds" and barriers, but I'm absolutely sure that 10E-6 network speed is totally negligible, especially after I accepted imx6 FEC errata about 50% speed deficiency, it can deliver about 400Mbit instead of 800Mbit.

Do I accept CPU miscalculations with 1E-9 probability? Of course I do, I'm not stupid. If CPU were fault tolerant too, and it would fix it's own errors automatically, why not. What I would see on the BLACK BOX, its speed is slower by 1E-7 percent, but it works perfectly anyway. And who cares?

Of course if you discover something more serious with that shady-looking FEC, or he is just acting suspiciously, makes a bad move, that must be investigated, I agree with that. But if only 1E-7 speed deficiency, then let him go please. :smileywink:

detlevzundel · ‎12-17-2013

Hi Tele Laci,

our support request was (and still is) "FEC packet loss" and not about the speed of the network.

Thanks

Detlev

TeleLaci · ‎12-17-2013

Ok, I will do tests on my machine, as described above, then I will report. I must help if I can. I always wanted to test my network.

It seems long thing, I will run it while I sleep, then I'll be back

Laci

FEC ethernet packetloss

FEC ethernet packetloss

i.MX6_All

i.MX6Dual

i.MX6Quad

i.MX6S