i.MX53 FEC transmitting stale buffer.

RobertDaniels · ‎02-25-2014

I've run across an odd issue with our i.MX53 board where the fec driver seems to be sending an older packet from the ring buffer on transmit. I've done some testing with iperf3 with my tx ring buffer size set to 16 and get results looking like this:

Linux ic-ii-22 2.6.35.3 #8 PREEMPT Tue Feb 25 10:30:36 MST 2014 armv7l GNU/Linux

warning: this system does not seem to support IPv6 - trying IPv4

-----------------------------------------------------------

Server listening on 5201

-----------------------------------------------------------

Time: Tue, 25 Feb 2014 17:35:13 GMT

Accepted connection from 192.168.1.151, port 37166

Cookie: ic-ii-65.1393349687.935384.5ab881022

[ 5] local 192.168.1.193 port 5201 connected to 192.168.1.151 port 60511

Starting Test: protocol: UDP, 1 streams, 64 byte blocks, omitting 0 seconds, 10 second test

iperf3: OUT OF ORDER - incoming packet = 6 and received packet = 21 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 22 and received packet = 37 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 38 and received packet = 53 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 54 and received packet = 69 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 70 and received packet = 85 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 86 and received packet = 101 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 102 and received packet = 117 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 118 and received packet = 133 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 134 and received packet = 149 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 150 and received packet = 165 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 166 and received packet = 181 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 182 and received packet = 197 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 198 and received packet = 213 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 214 and received packet = 229 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 230 and received packet = 245 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 246 and received packet = 261 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 262 and received packet = 277 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 278 and received packet = 293 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 294 and received packet = 309 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 310 and received packet = 325 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 326 and received packet = 341 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 342 and received packet = 357 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 358 and received packet = 373 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 374 and received packet = 389 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 390 and received packet = 405 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 406 and received packet = 421 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 422 and received packet = 437 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 438 and received packet = 453 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 454 and received packet = 469 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 470 and received packet = 485 AND SP = 5

iperf3: OUT OF ORDER - incoming packet = 486 and received packet = 501 AND SP = 5

Notice that the packet numbers are 16 apart. The board does not consistently do this and will run correctly for a while but then fall into this type of error situation. I've even seen more than one buffer get into this state where the pattern will be every 6th and 10th packet. Also, if I pull the ethernet cable and then plug it back in the situation will be resolved for a while.

Has anyone seen something like this before or have any ideas what could be causing this problem?

I've attached a Wireshark capture illustrating this problem. In the capture a ping request will be skipped until later when it goes out together with a later ping request.

Message was edited by: Robert Daniels

Original Attachment has been moved to: FECProblem.txt.zip

fabio_estevam · ‎03-12-2014

Robert,

Could you try a more recent kernel, such as 3.14-rc6? If the problem also occurs with this version please report it to the netdev mailing list.

Regards,

Fabio Estevam

RobertDaniels · ‎03-13-2014

Fabio,

I attempted to compile the mainline kernel for the i.MX53 Quick Start Board but I didn't see any board specific support for the QSB and when I tried to run the kernel it did not work. What steps do I need to follow to get a working mainline kernel for the i.MX53 Quick Start Board? Or is there a precompiled kernel somewhere I can test with?

Thanks!

Robert

fabio_estevam · ‎03-13-2014

Robert,

Mainline kernel does not use board file anymore for mx53qsb and it uses device tree file arch/arm/boot/dts/imx53-qsb.dts instead.

Build the kernel:

make -j4 uImage LOADADDR=0x70008000

Build the the dtb:

make imx53-qsb.dtb

Then load the kernel and dtb into RAM and boot.

I suggest you to use a mainline U-boot as well, as it is capable of booting dtb. Look at U-boot's include/configs/imx53loco.h to know the addresses that you need to load the uImage and dtb for booting.

Regards,

Fabio Estevam

RobertDaniels · ‎03-13-2014

Fabio,

I tested with the 3.14.0-rc6+ kernel on the i.MX53 Quick Start Board. The situation with this kernel is much improved. I do not get any out of order or dropped packets - however I am seeing some 'pauses' in the transfers. Periodically while running the test, transmission will stop for a few seconds and then restart.

- Robert

fabio_estevam · ‎03-13-2014

Robert,

Please report this with details on how to reproduce the 'pauses' with 3.14-rc6 in the netdev mailing list.

Also Cc some folks that work in the development of this drivers, such as Frank Li, Fugang Duan, Marek Vasut, and myself.

To get the email address do a "./scripts/get_maintainer.pl -f drivers/net/ethernet/freescale/fec_main.c"

Hopefully someone will be able to provide some suggestions as to how to fix this 'pause' issue.

Regards,

Fabio Estevam

fabio_estevam · ‎03-14-2014

Robert,

Another test you could do: please select CONFIG_SMSC_PHY=y in your config file have a try. By default, the imx_v6_v7_defconfig uses the generic ethernet phy driver, so I would like to know if you get the same 'pause' issue with the specific SMSC driver as well.

Regards,

Fabio Estevam

RobertDaniels · ‎03-14-2014

Fabio,

I enabled the SMSC phy driver and had the same issue. I noticed that the first time I get the 'pause' I got a report from the kernel about an fec transmit queue time out. The report/backtrace is as follows:

Linux Kernel 3.14-rc6 Report

------------[ cut here ]------------

WARNING: CPU: 0 PID: 0 at /home/robertd/Development/IC/Dev/BoardSupport/ic-ii/linux-mainline/net/sched/sch_generic.c:264 dev_watchdo

g+0x288/0x2ac()

NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out

Modules linked in:

CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.14.0-rc6+ #3

Backtrace:

[<800121bc>] (dump_backtrace) from [<800124a0>] (show_stack+0x18/0x1c)

r6:8051f034 r5:00000000 r4:808edb9c r3:00000000

[<80012488>] (show_stack) from [<80651e30>] (dump_stack+0x84/0x9c)

[<80651dac>] (dump_stack) from [<80027d1c>] (warn_slowpath_common+0x70/0x94)

r5:00000009 r4:808c9d60

[<80027cac>] (warn_slowpath_common) from [<80027d78>] (warn_slowpath_fmt+0x38/0x40)

r8:ded37b40 r7:808c8000 r6:ded37b00 r5:dec34000 r4:00000000

[<80027d44>] (warn_slowpath_fmt) from [<8051f034>] (dev_watchdog+0x288/0x2ac)

r3:dec34000 r2:80832508

[<8051edac>] (dev_watchdog) from [<80031e68>] (call_timer_fn+0x74/0xf4)

r10:80031df4 r9:dec34000 r8:8051edac r7:808c8000 r6:00000100 r5:808c8000

r4:808c9dd0

[<80031df4>] (call_timer_fn) from [<80032598>] (run_timer_softirq+0x19c/0x234)

r10:8051edac r9:dec34000 r8:00200200 r7:00000000 r6:808c9e20 r5:80929fc0

r4:dec34284

[<800323fc>] (run_timer_softirq) from [<8002c2f4>] (__do_softirq+0x110/0x2b4)

r10:00000100 r9:00000001 r8:40000001 r7:808c8000 r6:808ca080 r5:808ca084

r4:00000000

[<8002c1e4>] (__do_softirq) from [<8002c7ac>] (irq_exit+0xb8/0x10c)

r10:8065b4cc r9:00000001 r8:00000000 r7:00000037 r6:808c8000 r5:808c4fe8

r4:808c8028

[<8002c6f4>] (irq_exit) from [<8000f2a0>] (handle_IRQ+0x5c/0xbc)

r5:808c4fe8 r4:808d0d24

[<8000f244>] (handle_IRQ) from [<80008590>] (tzic_handle_irq+0x78/0xa8)

r8:808c9f10 r7:00000001 r6:00000020 r5:80928fd8 r4:00000000 r3:00000080

[<80008518>] (tzic_handle_irq) from [<800130a4>] (__irq_svc+0x44/0x5c)

Exception stack(0x808c9f10 to 0x808c9f58)

9f00: 00000001 00000001 00000000 808d3e70

9f20: 808c8000 808d099c 808d0938 8092837d 00000000 808c8000 8065b4cc 808c9f64

9f40: 808c9f28 808c9f58 800638e0 8000f674 20000013 ffffffff

r9:808c8000 r8:00000000 r7:808c9f44 r6:ffffffff r5:20000013 r4:8000f674

[<8000f64c>] (arch_cpu_idle) from [<8006e874>] (cpu_startup_entry+0x108/0x160)

[<8006e76c>] (cpu_startup_entry) from [<8064cc24>] (rest_init+0xb4/0xdc)

r7:808b7358

[<8064cb70>] (rest_init) from [<80878b58>] (start_kernel+0x328/0x38c)

r6:ffffffff r5:808d0880 r4:808d0a30

[<80878830>] (start_kernel) from [<70008074>] (0x70008074)

---[ end trace cdbcbb8ba9a01909 ]---

We are currently using the Freescale 2.6.35 kernel and would like to have this problem fixed - is our best bet to move to the mainline kernel? If so, is it possible to use the mainline kernel and have full OpenGL & EGL support?

I will send a report in to the netdev mailing list.

Thanks!

Robert Daniels

karina_valencia · ‎03-25-2014

cyborgnegotiator Mar 21, 2014 8:34 AM (in response to jamesbone)

Hi,

I try reproduce your issue on FSL image of Ubuntu (L2.6.35_11.05.01_ER_images_MX5X + lucid_1108) but without success...

There is few differences and I want clarify some yours data.

Could you please correct your steps - how to reproduce issue and specify exactly version of customer os (kernel, system, iperf version, etc...)

- step 1. do you need ipv6? -V parameter, server listening on tcp... (for udp, missing -u parameter)

- step 2. ipv6, udp (different from server settings)

- step 3. any public web source? I made test for download big file from http://gentoo.org/.... (~160MB)

Thanks,

Jozef

RobertDaniels · ‎03-26-2014

Jozef,

My i.MX53 setup is as follows:

OS: Linux version 2.6.35.3-1129-g691c08a (rogerio@b19259) (gcc version 4.4.4 (4.4.4_09.06.2010) ) #7 PREEMPT Wed Nov 16 14:33:06 BRST 2011

System: Ubuntu 10.04 LTS

iperf3: iperf-3.0.1

My Desktop linux system is Ubuntu 12.04 and the ip address is 192.168.1.101. I have webfs configured on it with a ~27 MB image file called test.bmp in /srv/ftp.

To reproduce this issue you also need the right network configuration. I've tried this same test with both devices plugged into the same router and everything runs just fine.

My network setup is as follows:

Desktop ---> Cisco ValetPlus Gigabit Router

i.MX53 ---> NetGear DS 108 10/100 Hub ---> Cisco ValetPlus Gigabit Router

Steps to reproduce:

1. On Desktop run: iperf3 -s -V

2. On i.MX53 ssh session 1 run: iperf3 -c 192.168.1.101 -u -l 64 -b 55M -V -t 1000

3. On i.MX53 ssh session 2 run: cd /tmp; while true; do date; wget http://192.168.1.101:8000/test.bmp; rm -fv /tmp/test.bmp; done

When I run this the iperf test starts off fine but when the wget command starts I start seeing packet loss. After running longer I see out of order packets which indicates the previously stated issue.

Thanks,

Robert

cyborgnegotiato · ‎04-01-2014

Hi,

Could you capture network traffic on each sides? You can capture traffic with tcpdump or other similar tool.

Thanks,

Jozef

RobertDaniels · ‎04-01-2014

Jozef,

Sure.

My test setup:

Client: i.MX53 QSB running Ubuntu Linux 10.04 Linux version 2.6.35.3-1129-g691c08a (rogerio@b19259) (gcc version 4.4.4 (4.4.4_09.06.2010) ) #7 PREEMPT Wed Nov 16 14:33:06 BRST 2011
Host: Ubuntu 12.04
Observer: Windows 7

The test:

Run the test as previously outlined until I start seeing the out-of-order packets
Shutdown all previous tests including the two ssh connections
Host: sudo tcpdump -nnvvXSs 0 -c100 icmp -w host.pcap
Client: sudo tcpdump -nnvvXSs 0 -c100 icmp -w imx53.pcap
Client: ping Host
Observer: Wireshark with icmp capture filter

I have attached the generated packet captures. These captures were run after running the same test previously so the results are impacted by this. The interesting thing to note in these captures is this:

i.MX53	Observer	Host
request: seq 1	request: seq 1	request: seq 1
	request: seq 45	reply: seq 1
reply: seq 1	reply: seq 1	request: seq 45
reply: seq 45	reply: seq 45	reply: seq 45
request: seq 2
request: seq 3	request: seq 3	request: seq 3
reply: seq 3	reply: seq3	reply: seq 3

Here we see that the i.MX53 thought it only sent sequences 1, 2, & 3 but what in fact actually happened was sequences 1, 45, 2, & 3 were sent because the packet from the previous test was still in the ring buffer and was sent when the buffer descriptor from the previous position was sent.

Once the i.MX53 gets into this state, it stays there until it is power-cycled or the ethernet cable is unplugged and then plugged back in.

DuanFugang · ‎04-22-2014

hi, Robert,

From your log, it is really packet order issue on FEC tx ring buffer.

I setup the test case on imx53 SMD platform (it is the same as your description):

Test condition:

linux PC <---------> NetGear switch <---------> imx53 SMD board

PC IP: 10.192.242.202

imx53 IP: 10.192.242.45

bin.tgz: size is 27MBytes, located at 10.192.242.233 server, can be downloaded: http://10.192.242.233/bin.tgz

Iperf3 version: iperf version 3.0.1 (10 January 2014)

kernel version:

root@freescale ~$ uname -r

2.6.35.3-01129-g691c08a

Test steps:

1. run iperf3 on PC: iperf3 -s -V

2. start one imx53 telnet terminal 1, and run the script: while true; do date; wget http://10.192.242.233/bin.tgz; rm -fv ./bin.tgz; done &

3. start one imx53 telnet terminal 2, ping PC: ping 10.192.242.202

4. run iperf3 on imx53 serial terminal: iperf3 -c 10.192.242.202 -u -l 64 -b 55M -V -t 1000

After 5 time test (each test time is 1000 seconds), from PC iperf3 log, no "iperf3: OUT OF ORDER" log found. And check imx53 telnet terminal 2 ping log, no ping order issue found.

The environment is the same as your description, but it really cannot reproduce the order issue.

From my test steps, do you have any comment ?

Of course, i continue to do more test. Hope to get your response ASAP.

Thanks,

Andy

RobertDaniels · ‎04-22-2014

Andy,

The real key to duplicating this issue is to use an ethernet hub in place of the NetGear switch. The hub is half-duplex and seems to cause this problem to manifest very quickly. I've been using a 10/100 Mbps hub.

I believe the core problem is that the driver and the fec are getting out-of-sync in regard to the tx ring buffer. Not knowing how the fec actually works, it appears that the fec is erroneously advancing its cur_tx. When this occurs, the driver writes to its cur_tx (tx_bd_base[10]) which is one behind the fec cur_tx (tx_bd_base[11]) and tells the fec to transmit. Since the fec has already advanced past that buffer descriptor, nothing is transmitted. The driver then advances its cur_tx (tx_bd_base[11]) which then matches the one in the fec (tx_bd_base[11). Once the two are in sync again, everything works as it should except that one packet did not get transmitted. When the fec gets to the buffer descriptor previous to the one that was ignored (tx_bd_base[9]), that packet is transmitted, then the fec checks the next one (the ignored buffer descriptor - tx_bd_base[10]) and sees that it should be transmitted and then transmits this packet (out-of-order.) This then causes the fec to be out of sync with the driver again (fec cur_tx = tx_bd_base[11] and driver cur_tx = tx_bd_base[10]) and the scenario is repeated.

I was able to add code to the driver to recognize when the driver and fec were out-of-sync and correct the situation. This results in occasional dropped packets but it avoids the case where the same buffer descriptor is always going to be sent out-of-order until the fec is reset. This change allows me to run my test with only minor issues rather than the catastrophic problems I've seen when too many of the buffer descriptors get out-of-sync.

Please let me know if there is anything else I can do to help.

Thanks,

Robert

DuanFugang · ‎04-23-2014

hi, Robert,

Thanks for your quick response.

Later, i took one day to reproduce the issue, but failed.

I will get one ethernet hub, and then test again.

Can you share the attached code for the kernel 2.6.35 fec driver that detect the out-of-sync issue and correct the situation ?

Thanks,

Andy

RobertDaniels · ‎04-25-2014

Andy,

Sure, I'll post my changes. I actually backported the fec driver from the 3.0.35 kernel so the driver I'm posting is derived from that. There are many tweaks in my file which didn't seem to affect the issue so you'll have to wade through those. The salient changes are in fec.c functions fec_enet_start_xmit and fec_enet_tx where I repurpose the BD_ENET_TX_INTR bit in the buffer descriptor to detect when a buffer descriptor has a next neighbor which has been transmitted while the current dirty buffer descriptor has not.

Thanks,

Robert

DuanFugang · ‎04-29-2014

hi, Robert,

Thanks.

karina_valencia · ‎03-27-2014

cyborgnegotiator please continue with the follow up

RobertDaniels · ‎03-11-2014

After investigating this issue more thoroughly, it is starting to look like an issue with the FEC. I've determined that the issue arises at some point when a buffer descriptor is prepared in the fec driver and marked as ready to transmit. Then the ENET_TDAR[TDAR] bit is set and instead of being transmitted nothing happens. When the next packet needs to be transmitted, the next buffer descriptor is prepared and is marked ready for transmit and again the ENET_TDAR[TDAR] bit is set and this time the newest packet will be transmitted. This will continue on until the buffer descriptor previous to the one that never was transmitted is prepared and marked as ready. When ENET_TDAR[TDAR] is set this time, the newest packet is transmitted and then the old packet that was marked as ready but wasn't transmitted will be sent resulting in an out of order packet. When this buffer descriptor is used again, the exact same thing will happen and will always behave in this manner.

To make matters worse, this can happen to any of the buffer descriptors and if enough of them get into this state then ethernet transmitting ceases altogether. To resolve, the FEC must be reset/restarted - the simplest approach is to unplug the ethernet cable and plug it back in.

I did some testing with a new i.MX53 Quick Start Board using the supplied version of Ubuntu on their SD card and was able to reproduce this issue with the following test:

1. On a linux desktop machine run iperf3 with this command:

iperf3 -s -V

2. On the i.MX53 (from ssh) run iperf3 with this command:

iperf3 -c <linux desktop ip address> -u -l 64 -b 55M -V -t 1000

3. On the i.MX53 (from ssh) run a shell script that continuously downloads a large image file from a web server such as:

while true; do date; wget http://path/to/test.bmp ; rm -fv /tmp/test.bmp; done

4. Watch the output on the linux desktop machine - after running for a while you should see many reports of out of order packets and if left to run long enough nothing at all will be transmitted.

Does anyone know anything about this?

jamesbone · ‎03-12-2014

We already escalate your issue, we will reply as soon as we got some inputs