The performance for sending udp packet

qiancui · ‎09-12-2014

Hardware Platform: SABRE SD

Software :Android 4.4.2 Kernel:3.0.35

I want to develop the high-speed wireless card driver on this platform.

At first I want to check the max performance of the udp sending in this platform. So I make a simple socket application in C, just sending the 7900 bytes size udp packet in one single thread as fast as possible through the loopback device. But I found that the performance can only reach about 800Mbps, and CPU utilization is just 25%.

My question is how to increate the UDP sending performance in one single thread?

eliedebrauwer · ‎09-23-2014

Hello Qian,

On a BoundaryDevices6x board with an i.mx6q running Ubuntu 14.04 I installed iperf and ran "iperf -u -s -i 1" and "iperf -u -c 127.0.0.1 -i 1 -b 2G -s 8000 -t 100". Basically this will send 8000 byte UDP packets over loopback as fast as possible. The only modification I have done on my system is to enlarge the socket buffer (see /proc/sys/net/core/ and there the files rmem_max, rmem_default, rmem_max and wmem_default, I put size of 10 megabyte in there to rule out their effect). When running this test my generating iperf will consume a full core and will be able to generate about 710 mbit/s and the receiving thread will be able to accept this data only consuming half a core.

Could it be that the reported CPU usage in your system actually means that 1 out of 4 cores (when using a quad) is occupied ? (I.e. your sending thread).

But in my case it's the iperf sender which is the bottleneck. You could consider sending the data in a more intelligent way, e.g. by doing less system calls e.g. man sendmmsg() or to enhance your sending loop.

The comments made by Igor are related to UDP performance using the onboard MAC and are not related when using loopback operating.

hth.

-

qiancui · ‎09-23-2014

Hi, Elie

Thx for your reply.

I just do my test in one thread for UDP test both using my device and loopback. But when using my device, the CPU occupy rate is only 25% for one CPU. So I think the key point is that the sendmsg function has a big delay for every calling. When I create more thread for sending in different CPU. The performance has increased. But it can only reach to 980Mbps. So it's very strange the four core CPU can't have four times as one. And also the TCP part is much poorer, can only 470Mbps. Maybe the delay for send() function is much more bigger.

PS: in my driver, it has already offload all the CRC checking to hardware

eliedebrauwer · ‎09-29-2014

Hello Qian,

Obviously doing a system call implies some overhead. What you could attempt to do is simply strace your application, strace has the -T parameter which shows the time spent in each system calls. This way you can calculate how much sendmsg() calls you can do per second (and hence your rate). In order to speed this up you could however try looking at sendmmsg() which allows the transmission of multiple datagrams through a single system call.

Although I would expect you would be able to saturate that one CPU. Are you sure you are counting both the time spent in kernel space and in user space for that CPU ?

qiancui · ‎10-30-2014

Hi, Eile.

Thx for your proposal, I have already checked the syscall of sending action in socket both for TCP and UDP with strace tool. And I increase the sending buffer size to eliminate the impact of syscall delay. But another problem is encounted, it seems some bottlenecks existed in the hardware, specifically in the PCI-E bus. The max performance I can reach is only 425 Mbps in TCP test with my high speed net devices based on PCI-E bus.

My suspection to the PCI-E bus bottleneck is based on the following reasons.

1) I use the NAPI mechenism to reduce the possible interrupt overload to the system. So I don't think the interrupt overload is the key point.

2) I do the same test to the loopback device and my device at the same time to eliminate the impact of task schedule difference. And the best performance of loopback devcie I can reach is 1Gbps in TCP. So the kernel network stack handling is also not the point to this issue.

3) In my driver, it has already offloaded all CRC checks to the hardware just like what the loopback device does. So except the transmission part all the packet handling is as the same as loopback. And finally I found it seems that DMA transmission between the board and my device is the bottleneck.

Has you tried the performance test for the PCI-E bus? What the best performance you can get? If you have, please share it with me.

igorpadykov · ‎09-12-2014

Hi Quan

you can try kernel boot parameter enable_wait_mode=off

also there is IMX6DQCE

ERR004512 ENET: 1 Gb Ethernet MAC (ENET) system limitation

Best regards

igor

qiancui · ‎09-22-2014

Thx very much for you reply.

And now I changed some parameter to larger both the TCP and UDP buffer in the kernel, and now I can get the maximum speed of TCP is 470Mbps, and UDP is 700Mbps(almost as the same as what the loopback dev does). But I didn't change the "kernel boot parameter enable_wait_mode=off", because it seems that it only affects the ENET card, and I used my own net card, so I thought the hardware will not be the bottleneck (my hardware can reach 2.8Gbps UDP and 1.0Gbps TCP in PC). So I suspect maybe there are some limitations in the freescale hardware or software to do much faster. My question is:

1) Is there any other way to promote the performance of the net transmission?

2) Do you have the maximum performance test in this platform? If have, please tell me your result, so that I can continue to optimize my design.

igorpadykov · ‎09-22-2014

Hi Qian

parameter enable_wait_mode is described in

i.MX_6Dual6Quad_SABRE-SD_Linux_Release_Notes.pdf

Table 6. Kernel Boot Parameters

L3.0.35_4.1.0_LINUX_DOCS

Best regards

igor

qiancui · ‎09-22-2014

Thx for your reply.

Now I know how to set the parameter for the kernel. And can you answer my the other two questions about the maximum performance?

""

And now I changed some parameter to larger both the TCP and UDP buffer in the kernel, and now I can get the maximum speed of TCP is 470Mbps, and UDP is 700Mbps(almost as the same as what the loopback dev does). But I didn't change the "kernel boot parameter enable_wait_mode=off", because it seems that it only affects the ENET card, and I used my own net card, so I thought the hardware will not be the bottleneck (my hardware can reach 2.8Gbps UDP and 1.0Gbps TCP in PC). So I suspect maybe there are some limitations in the freescale hardware or software to do much faster. My question is:

1) Is there any other way to promote the performance of the net transmission?

2) Do you have the maximum performance test in this platform? If have, please tell me your result, so that I can continue to optimize my design.

""

igorpadykov · ‎09-23-2014

Hi Qian

below some more performance data

https://community.freescale.com/docs/DOC-98322

https://community.freescale.com/message/336114#336114

~igor

igorpadykov · ‎09-22-2014

you can check below

http://boundarydevices.com/i-mx6-ethernet/

~igor

The performance for sending udp packet

The performance for sending udp packet

Android

i.MX6Dual

Linux