ls1021a custom DSA driver performance issues

harrisonnoja · ‎05-21-2026

Hi NXP Support Team,

Current kernel version - nxp qoriq 5.15

We are developing a custom Distributed Switch Architecture (DSA) driver for an external Ethernet switch IC (Analog Devices ADIN6310) directly connected to the LS1021A processor over an SGMII MAC-to-MAC link (eth1)The switch IC appends a custom 6-byte tail-tag to each frame in hardware, which our driver strips on the receive path via __skb_trim() before routing the frame to virtual network interfaces.

1. Testing Outline
Our architecture operates reliably under standard operational loads (SSH, DHCP, and control frame routing function perfectly). To test the limits of the driver under heavy network pressure, we are running benchmarks using iperf3.
To isolate hardware variables, the ADIN6310 switch IC was previously validated using their dev board between 2 devices running the ls1021A processors. In this test, the bitrates hit maximum wire limits and the TCP retransmission metrics remained consistently clean, proving the switch hardware and the standalone tagger logic are functionally sound.

2. The Core Issue We Are Experiencing
When executing the iperf3 test on the LS1021A platform the remote sender logs a continuous storm of over 5,000 TCP Retransmissions.
During these high-speed tests, we observe that the physical eTSEC rx-oversize-packets register counter climbs rapidly by tens of thousands of ticks. We successfully isolated and fixed this counter increase by reducing the MTU on the remote sending side to 1400 bytes. While the rx-oversize-packets counter drops to zero under this test, the core issue of 5,000+ TCP retransmissions remains identically present.

3. Temporary Workarounds (Undesirable for Production)
We have successfully eliminated the retransmissions down to absolute zero under only two specific scenarios, both of which are highly restrictive and undesirable for our production requirements:
Strict Core Pinning: Forcing the user-space iperf3 server strictly onto Core 0 (taskset -c 0) drops retransmissions to zero. This occurs because overloading Core 0 shrinks the TCP Receive Window, forcing the sender to slow down, which artificially hides the race condition at the cost of processing capacity and bandwidth.
Hardware Queue Group Removal: Commenting out the secondary hardware queue group node directly within the Device Tree file completely forces single-queue serialization on Core 0 and eliminates the retransmissions. However, this strips our system of multi-core network scaling and creates a permanent performance bottleneck.

In attempting to resolve the performance breakdown on this LS1021A target platform, we have executed an exhaustive matrix of low-level software and driver optimizations, including:

Queue Size Adjustments: Significantly increased both the rx and tx ring descriptor buffer sizes directly within the gianfar driver source code .
Process & Thread Prioritization: Manually elevated the real-time scheduling priorities of both the eth1 threaded IRQ handlers and the user-space testing processes.
Affinity Steering: Attempted to isolate execution boundaries by shifting the CPU affinity matrices of the application threads and eth1 registers across the cores.
Tagger Optimization: Refactored and streamlined our custom DSA receive hook function trailer_rcv to ensure zero-copy payload handling and minimal processing execution latency.

4. Assistance Requested
I am reaching out to see if you can provide any assistance on this issue, is there anything I can look at to improve the reliability of this network driver. I have also verified that the enet interface definitions in the device tree contain dma-coherent. Are there any driver-level tracking or pipeline variables within the gianfar NAPI group ring context that we can adjust to enforce a strict chronological synchronization barrier when routing custom L2 tail-tagged DSA frames over split dual-core architectures?

Thank you for your assistance.

Bio_TICFSL · ‎05-22-2026

Hello,

Yes — there are a few driver/hardware knobs worth checking on LS1021A , but I could not find any “strict chronological synchronization barrier” in gianfar/NAPI that would serialize cross-core DSA RX processing for your custom tail-tagged traffic. The available documentation supports a different conclusion: LS1021A eTSEC/gianfar is designed to distribute Rx/Tx work across two interrupt groups/CPUs, and packet ordering/coherency is expected to be managed by queue steering, interrupt affinity, and software design — not by a special ring-context ordering barrier in gianfar

I could not verify a gianfar “strict chronological synchronization barrier”; the documented levers are frame-size configuration, queue-group/IRQ affinity, interrupt coalescing, and Rx ring/free-BD management on LS1021A eTSEC.

Regards