Hardware Platform: SABRE SD
Software :Android 4.4.2 Kernel:3.0.35
I want to develop the high-speed wireless card driver on this platform.
At first I want to check the max performance of the udp sending in this platform. So I make a simple socket application in C, just sending the 7900 bytes size udp packet in one single thread as fast as possible through the loopback device. But I found that the performance can only reach about 800Mbps, and CPU utilization is just 25%.
My question is how to increate the UDP sending performance in one single thread?
Hello Qian,
On a BoundaryDevices6x board with an i.mx6q running Ubuntu 14.04 I installed iperf and ran "iperf -u -s -i 1" and "iperf -u -c 127.0.0.1 -i 1 -b 2G -s 8000 -t 100". Basically this will send 8000 byte UDP packets over loopback as fast as possible. The only modification I have done on my system is to enlarge the socket buffer (see /proc/sys/net/core/ and there the files rmem_max, rmem_default, rmem_max and wmem_default, I put size of 10 megabyte in there to rule out their effect). When running this test my generating iperf will consume a full core and will be able to generate about 710 mbit/s and the receiving thread will be able to accept this data only consuming half a core.
Could it be that the reported CPU usage in your system actually means that 1 out of 4 cores (when using a quad) is occupied ? (I.e. your sending thread).
But in my case it's the iperf sender which is the bottleneck. You could consider sending the data in a more intelligent way, e.g. by doing less system calls e.g. man sendmmsg() or to enhance your sending loop.
The comments made by Igor are related to UDP performance using the onboard MAC and are not related when using loopback operating.
hth.
-
Hi, Elie
Thx for your reply.
I just do my test in one thread for UDP test both using my device and loopback. But when using my device, the CPU occupy rate is only 25% for one CPU. So I think the key point is that the sendmsg function has a big delay for every calling. When I create more thread for sending in different CPU. The performance has increased. But it can only reach to 980Mbps. So it's very strange the four core CPU can't have four times as one. And also the TCP part is much poorer, can only 470Mbps. Maybe the delay for send() function is much more bigger.
PS: in my driver, it has already offload all the CRC checking to hardware
Hello Qian,
Obviously doing a system call implies some overhead. What you could attempt to do is simply strace your application, strace has the -T parameter which shows the time spent in each system calls. This way you can calculate how much sendmsg() calls you can do per second (and hence your rate). In order to speed this up you could however try looking at sendmmsg() which allows the transmission of multiple datagrams through a single system call.
Although I would expect you would be able to saturate that one CPU. Are you sure you are counting both the time spent in kernel space and in user space for that CPU ?
Hi, Eile.
Thx for your proposal, I have already checked the syscall of sending action in socket both for TCP and UDP with strace tool. And I increase the sending buffer size to eliminate the impact of syscall delay. But another problem is encounted, it seems some bottlenecks existed in the hardware, specifically in the PCI-E bus. The max performance I can reach is only 425 Mbps in TCP test with my high speed net devices based on PCI-E bus.
My suspection to the PCI-E bus bottleneck is based on the following reasons.
1) I use the NAPI mechenism to reduce the possible interrupt overload to the system. So I don't think the interrupt overload is the key point.
2) I do the same test to the loopback device and my device at the same time to eliminate the impact of task schedule difference. And the best performance of loopback devcie I can reach is 1Gbps in TCP. So the kernel network stack handling is also not the point to this issue.
3) In my driver, it has already offloaded all CRC checks to the hardware just like what the loopback device does. So except the transmission part all the packet handling is as the same as loopback. And finally I found it seems that DMA transmission between the board and my device is the bottleneck.
Has you tried the performance test for the PCI-E bus? What the best performance you can get? If you have, please share it with me.
Thx very much for you reply.
And now I changed some parameter to larger both the TCP and UDP buffer in the kernel, and now I can get the maximum speed of TCP is 470Mbps, and UDP is 700Mbps(almost as the same as what the loopback dev does). But I didn't change the "kernel boot parameter enable_wait_mode=off", because it seems that it only affects the ENET card, and I used my own net card, so I thought the hardware will not be the bottleneck (my hardware can reach 2.8Gbps UDP and 1.0Gbps TCP in PC). So I suspect maybe there are some limitations in the freescale hardware or software to do much faster. My question is:
1) Is there any other way to promote the performance of the net transmission?
2) Do you have the maximum performance test in this platform? If have, please tell me your result, so that I can continue to optimize my design.
Hi Qian
parameter enable_wait_mode is described in
i.MX_6Dual6Quad_SABRE-SD_Linux_Release_Notes.pdf
Table 6. Kernel Boot Parameters
Best regards
igor
Thx for your reply.
Now I know how to set the parameter for the kernel. And can you answer my the other two questions about the maximum performance?
""
And now I changed some parameter to larger both the TCP and UDP buffer in the kernel, and now I can get the maximum speed of TCP is 470Mbps, and UDP is 700Mbps(almost as the same as what the loopback dev does). But I didn't change the "kernel boot parameter enable_wait_mode=off", because it seems that it only affects the ENET card, and I used my own net card, so I thought the hardware will not be the bottleneck (my hardware can reach 2.8Gbps UDP and 1.0Gbps TCP in PC). So I suspect maybe there are some limitations in the freescale hardware or software to do much faster. My question is:
1) Is there any other way to promote the performance of the net transmission?
2) Do you have the maximum performance test in this platform? If have, please tell me your result, so that I can continue to optimize my design.
""
Hi Qian
below some more performance data
https://community.freescale.com/docs/DOC-98322
https://community.freescale.com/message/336114#336114
~igor