ls1012ardb - throughput measurement under high load fails, eth2 stops working

petervollmer · ‎04-24-2017

Hi,

I am currently evaluating the LS1012 processor using the ls1012ardb board and try to do some throughput measurements with our smartbits network performance analyzer. I tried with my own LS1012A-SDK-20161230-yocto build and with a prebuilt openwrt image ( Build vls1012a_1.2.1 for LS1012A) with the same results. So far I am not able to do a full measurement because one of the two interfaces apparently stops working. Here is what I found.

My setup is always the same for the boards I test

# cat setup.sh
#!/bin/sh

LAN=eth0

WAN=eth2
ip link set $LAN up
ifconfig $LAN 172.18.1.1 netmask 255.255.0.0 up
ip link set $WAN up
ifconfig $WAN 172.19.1.1 netmask 255.255.0.0 up
echo 1 > /proc/sys/net/ipv4/ip_forward

The smartbits analyzer puts UDP packets (src port 5000, dst port 5000 ) bidirectionally through the lan and wan ports for 30 seconds and tries to find the maximum throughput while less then 0.5 % of the sent frames get lost:

smb (max 1Gbps) -> lan0 -> CPU -> wan -> smb
smb <- lan0 <- CPU <- wan <- smb (max 1 Gbps)

The iptables netfilter rules are empty (default rule ACCEPT) , only the conntrack entries for the UDP connection are used for fast forwarding of the frames.

I checked that the network interfaces are set up correctly before the throughput measurement starts. ICMP in both the lan and wan subnets works. The ARP entries of my test peers look okay:

root@OpenWrt:/# ip neigh
172.19.1.101 dev eth0 lladdr 00:0c:be:01:58:46 REACHABLE
172.18.1.254 dev eth2 lladdr 00:10:18:bb:b4:da REACHABLE

The first measurement even shows me a throughput of ~27400 frames per second (packet length 124 bytes, frames have been sent at rate of 370000 fps, and 92 percent get lost).

After (or during ) the first measurement however the eth2 interface stops working. All ARP entries relating to eth2 are stale and ICMP to my test peer in the lan subnet (172.18.1.254) does not work anymore:

root@OpenWrt:/# ip neigh
172.19.13.1 dev eth0 lladdr 00:00:08:00:00:01 REACHABLE
172.19.1.101 dev eth0 lladdr 00:0c:be:01:58:46 STALE
172.19.12.1 dev eth0 lladdr 00:00:07:00:00:01 REACHABLE
172.19.11.1 dev eth0 lladdr 00:00:06:00:00:01 REACHABLE
172.18.13.1 dev eth2 lladdr 00:00:04:00:00:01 STALE
172.19.10.1 dev eth0 lladdr 00:00:05:00:00:01 REACHABLE
172.18.12.1 dev eth2 lladdr 00:00:03:00:00:01 STALE
172.18.11.1 dev eth2 FAILED
172.18.10.1 dev eth2 lladdr 22:33:44:00:00:00 STALE
172.18.1.254 dev eth2 lladdr 00:10:18:bb:b4:da STALE
fe80::be30:5bff:fee5:f3cc dev eth0 lladdr bc:30:5b:e5:f3:cc STALE

The arp request for this address is sent (checked with tcpdump on the peer) and the answer is sent, but never arrives on the eth2 interface on the ls1012ardb board.

The rx counter of "ifconfig eth2" is not increased, however ethtool -S shows increasing counters indicating arriving

ARP packets (rx_broadcast) that are appenrently not handed over by the PFE:

root@OpenWrt:/# ethtool -S eth2 | grep "rx_"
rx_packets: 5575508
rx_broadcast: 1178
rx_multicast: 34
rx_crc_errors: 102296
rx_undersize: 0
rx_oversize: 0
rx_fragment: 6
rx_jabber: 0
rx_64byte: 1288
rx_65to127byte: 76
rx_128to255byte: 5574138
rx_256to511byte: 0
rx_512to1023byte: 0
rx_1024to2047byte: 0
rx_GTE2048byte: 0
rx_octets: 713761741
IEEE_rx_drop: 157
IEEE_rx_frame_ok: 4982309
IEEE_rx_crc: 102296
IEEE_rx_align: 0
IEEE_rx_macerr: 410986
IEEE_rx_fdxfc: 0
IEEE_rx_octets_ok: 637652233

root@OpenWrt:/# ethtool -S eth2 | grep "rx_"
rx_packets: 5575546
rx_broadcast: 1213
rx_multicast: 36
rx_crc_errors: 102296
rx_undersize: 0
rx_oversize: 0
rx_fragment: 6
rx_jabber: 0
rx_64byte: 1324
rx_65to127byte: 77
rx_128to255byte: 5574139
rx_256to511byte: 0
rx_512to1023byte: 0
rx_1024to2047byte: 0
rx_GTE2048byte: 0
rx_octets: 713764278
IEEE_rx_drop: 157
IEEE_rx_frame_ok: 4982346
IEEE_rx_crc: 102296
IEEE_rx_align: 0
IEEE_rx_macerr: 410986
IEEE_rx_fdxfc: 0
IEEE_rx_octets_ok: 637654706

Any clues as to what is happening here ? The setup is really nothing special and I successfully checked a number of other network hardware with it.

Thanks and with best regards

Peter

mdecandia · ‎10-12-2017

HI all,

any update on this issue? We are facing the same behaviour.

Michele

cyrilstrejc · ‎05-02-2017

Peter,

I also observed the same behaviour as you described (one of the ethernet interfaces stopped working). The hardware was LS1012A-RDB, I do not remember the software version, so I can't post reasonable report. I'm just writing to let you know you are not alone with the issue. I used netperf to drive bidirectional test.

bpe · ‎04-28-2017

Two questions:

1. Did you try your test with the most recent QorIQ SDK (1701) in the
   default build configuration for your target? It is fresher than
   the BSP you are working with and does support your board.

2. CRC error counts in your ethtool outputs are quite high, which is
   not normal. Did you try diagnosing your network, or testing against
   another equipment? Can you reproduce the lockup with a lower CRC
   error counts?

Basically, we test the SDK for stability on supported targets and
even less noticeable issues are logged, like QLINUX-5841, see the
most recent SDK documentation, Table 13. There must be something
specific to your board, software configuration or test setup that
leads to the interface lockups.

Have a great day,
Platon

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

petervollmer · ‎04-28-2017

> 1. Did you try your test with the most recent QorIQ SDK (1701) in the
> default build configuration for your target? It is fresher than
> the BSP you are working with and does support your board.

Thanks for your input. Before doing anything else I will build the most recent QorIQ SDK 2.0 (1701) image and repeat my throughput test. I'll post my results here.

Thanks and with best regards

Peter

petervollmer · ‎05-02-2017

I now used the most recent prebuilt image included in the QorIQ SDK 2.0 (1703) update to repeat my tests (images/ls1012ardb/kernel-fsl-ls1012a-rdb-20170329204856.itb). Here is what I found:

First, there is still a warning about an incompatible PFE firmware version included in the image:

[ 5.473645] pe_load_ddr_section: load address(3fb0000) and elf file address(ffff00000039b000) rcvd
[ 5.507073] PFE binary version: pfe_ls1012a_00_3-3-g1fa4da1-dirty
[ 5.513178] pfe_firmware_init: class firmware loaded 0xa60 0xc3010000
[ 5.519635] pfe_load_elf
[ 5.523185] WARNING: PFE firmware binaries from incompatible version
[ 5.529558] pfe_firmware_init: tmu firmware loaded 0x200

The throughput test still fails with the same behaviour. However this time I noticed some peculiarities while watching the output of "ip neigh" during the test. In the beginning all 8 IP addresses involved in the test are listed correctly with their MAC addresses, as configured for the smartbits tester.

172.18.10.1 dev eth0 lladdr 00:00:01:00:00:01 DELAY

172.18.11.1 dev eth0 lladdr 00:00:02:00:00:01 DELAY
172.18.12.1 dev eth0 lladdr 00:00:03:00:00:01 DELAY
172.18.13.1 dev eth0 lladdr 00:00:04:00:00:01 DELAY

172.19.10.1 dev eth1 lladdr 00:00:05:00:00:01 DELAY

172.19.11.1 dev eth1 lladdr 00:00:06:00:00:01 DELAY
172.19.12.1 dev eth1 lladdr 00:00:07:00:00:01 DELAY

172.19.13.1 dev eth1 lladdr 00:00:08:00:00:01 DELAY

However shortly after that some MAC adresses seem to get garbled:

172.19.11.1 dev eth1 lladdr 00:00:06:00:00:01 REACHABLE
172.19.13.1 dev eth1 lladdr 22:33:44:07:00:00 REACHABLE
172.19.10.1 dev eth1 lladdr 00:00:05:00:00:01 REACHABLE
172.19.12.1 dev eth1 lladdr 22:33:44:06:00:00 REACHABLE
172.18.10.1 dev eth0 lladdr 00:00:01:00:00:01 REACHABLE
172.18.11.1 dev eth0 lladdr 00:00:02:00:00:01 REACHABLE
172.18.12.1 dev eth0 lladdr 00:00:03:00:00:01 REACHABLE
172.18.13.1 dev eth0 lladdr 22:33:44:05:00:00 REACHABLE

I am pretty sure that the MACs starting with 22:33:44:-... are not sent out by the smartbits.

> Did you try diagnosing your network, or testing against
> another equipment? Can you reproduce the lockup with a lower CRC
> error counts?

I did not configure the smartbits tester to produce any CRC errors. I can not rule out that my test setup by itself produces any CRC errors but the count seems really high to me as well, compared to other tests I have run. The setup is used regularly to do throughput tests with other networking equipment.

With best regards,

Peter

laodzu · ‎05-16-2017

Hi Peter,

you should check if you have an early revision of the RDB board. According to the errata revisions before Rev D have clocking problems on the ethernet path:

https://www.nxp.com/webapp/Download?colCode=LS1012ARDBE&Parent_nodeId=1462294874819702554554&Parent_...

E-00001 explicitely says that this leads to excessive CRC errors just as you see it.

Best wishes

Detlev

ls1012ardb - throughput measurement under high load fails, eth2 stops working

ls1012ardb - throughput measurement under high load fails, eth2 stops working

QorIQ LS1 Devices