I am using a T4240RDB as a network router with 10G ports. ( Linux 3.12 of SDK 1.7)
With an ingress bandwidth of ~9.6 Gbps, T4240RDB router throughput at egress is ~9.2 Gbps (measured at the T4240 using the dstat command as well as measured at the next hop router). The CPU utilization is at 17% at each core.
In our experiment, we need to create GRE tunnels from the T4240RDB to a remote router. With GRE tunnels created, the egress throughput reduces drastically to < 1Gbps. The htop output shows that all 24 cores are utilized 100%. The output of 'perf top' command indicate that the CPU cycles are spent in _raw_spin_lock, following dpa_tx. The screenshots of htop output and perf top are attached.
The ip_gre and ip_tunnel code in linux kernel do no use spin locks. Moreover, the same GRE experiment with a router running on i7 machine with ixgbe driver for Intel 10G cards , gave an egress throughput of ~7 Gbps.
Hence it looks like, dpa_tx is causing issues when it is scheduling the packets at 10Gbps to the CPU's for GRE encapsulation.
1) What could be the reason for this behavior and what could be a possible work around?
2) Can DPAA be totally disabled to make the box similar to an i7 machine and yet have all the 12 network interfaces available? (Disabling CONFIG_DPA_ETH removed all the network interfaces in the kernel)
Thanks in advance for the help.