Hi, Eile.
Thx for your proposal, I have already checked the syscall of sending action in socket both for TCP and UDP with strace tool. And I increase the sending buffer size to eliminate the impact of syscall delay. But another problem is encounted, it seems some bottlenecks existed in the hardware, specifically in the PCI-E bus. The max performance I can reach is only 425 Mbps in TCP test with my high speed net devices based on PCI-E bus.
My suspection to the PCI-E bus bottleneck is based on the following reasons.
1) I use the NAPI mechenism to reduce the possible interrupt overload to the system. So I don't think the interrupt overload is the key point.
2) I do the same test to the loopback device and my device at the same time to eliminate the impact of task schedule difference. And the best performance of loopback devcie I can reach is 1Gbps in TCP. So the kernel network stack handling is also not the point to this issue.
3) In my driver, it has already offloaded all CRC checks to the hardware just like what the loopback device does. So except the transmission part all the packet handling is as the same as loopback. And finally I found it seems that DMA transmission between the board and my device is the bottleneck.
Has you tried the performance test for the PCI-E bus? What the best performance you can get? If you have, please share it with me.