Hi,
I am using a T4240RDB as a network router with 10G ports. ( Linux 3.12 of SDK 1.7)
With an ingress bandwidth of ~9.6 Gbps, T4240RDB router throughput at egress is ~9.2 Gbps (measured at the T4240 using the dstat command as well as measured at the next hop router). The CPU utilization is at 17% at each core.
In our experiment, we need to create GRE tunnels from the T4240RDB to a remote router. With GRE tunnels created, the egress throughput reduces drastically to < 1Gbps. The htop output shows that all 24 cores are utilized 100%. The output of 'perf top' command indicate that the CPU cycles are spent in _raw_spin_lock, following dpa_tx. The screenshots of htop output and perf top are attached.
The ip_gre and ip_tunnel code in linux kernel do no use spin locks. Moreover, the same GRE experiment with a router running on i7 machine with ixgbe driver for Intel 10G cards , gave an egress throughput of ~7 Gbps.
Hence it looks like, dpa_tx is causing issues when it is scheduling the packets at 10Gbps to the CPU's for GRE encapsulation.
1) What could be the reason for this behavior and what could be a possible work around?
2) Can DPAA be totally disabled to make the box similar to an i7 machine and yet have all the 12 network interfaces available? (Disabling CONFIG_DPA_ETH removed all the network interfaces in the kernel)
Thanks in advance for the help.
Sareena.
The links given in the answer to the question are not accessible now.
Are these links given temporary access?
Can you please check or provide an alternate solution.
Thanks.
The profiling tool output you attached don't show actual calling
sequences, only the time cpu(s) spend in different function. For this
reason, it's not possible to establish where the spinlocks belong and
how, if at all, they are related to DPAA code. DPAA drivers do not directly
interact with ip_gre or ip_tunnel. The interaction between Linux IP
stack is at start_xmit() on transmit and netif_receive_skb() on
receive. Details can be found here:
If you are benchmarking a packet forwarding system that performs
flow based encapsulation/decapsulation, results can be suboptimal
due to inappropriate packet distribution and subsequent interlocking
between cores handling packets from the same flow.
Suggestions:
1. Study the recommendations for optimization for various use cases:
2. Try following FMC configuration suggestions from benchmarking
reproducibility guide for a similar, but another type of tunnel test:
https://freescale.sdlproducts.com/LiveContent/content/en-US/QorIQ_SDK/GUID-3F158B7B-66EB-4D6D-BE1A-F...
3 Identify the module/routine that spinlocks.
There is no way to exchange traffic with network interfaces bypassing DPAA.
The links are not accessible. Can you please help.
Access fails with the following message.
Thanks for the reply.
In the attempt to try to figure out the routine that spinlocks, I used the perf tool.
But on T4240, and since it is a kernel code, I was unable to get the kernel call trace during the workload when the system is at low performance.
Can you suggest any profiling tool that will enable me to do this?
Thanks,
Sareena.