-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreaded iperf3 #289
Comments
iperf3 is not multi-threaded, by design. Do you have any indication that your tests are CPU-bound? |
iperf3 is not multi-thread even if set the -P option? |
This probably explains it. |
@wangyoucao577 , @blochl : iperf3 is a complete rewrite and shares very little code in common with iperf3. The intended use case was to test high-rate single-stream performance, typical of science workflows on R&E networks. This doesn't require multi-threading for parallel threads. @blochl : You didn't answer my question yet about whether the single-threaded design of iperf3 actually created a problem for you. |
In our test scenario the CPU load is also measured. With Iperf3, contrary to Iperf2, the CPU load is constantly ~100%, but on a single CPU, which does not give much indication on the CPU load as a function of, e.g., buffer size. This is the issue. |
OK. I understand what you're seeing. I believe that multi-threading iperf3 would be a non-trivial amount of work (the design predates my involvement with this project), although I admit I haven't really thought too much about it. |
I've also tried the case. If use iperf and set the '-P' option, I can find more than one thread. But If use iperf3 and set the '-P' option, only one thread for client. So that the behavior of '-P' is also different between iperf and iperf3. Can I ask why iperf3 give up the multi-thread implement? Is there any difference between use multi-thread or just multi-socket here? |
Folks, the single-threaded behavior of iperf3 is not a mystery, and it doesn't require any investigation on anyone's part. It wasn't designed to be multi-threaded, and the implementation reflects that. I'm not sure exactly why it was designed this way. Maybe someone more closely involved with the initial iperf3 work can shed some light on this (I'll ask around a little more actively when I get back from vacation). |
OK. Thanks. I just want to know whether the multi-thread is necessary in some situation. Or maybe it's just not important. |
@wangyoucao577 : Well, it is important if you would like to measure the CPU performance as the traffic is being generated. Maybe there are other cases. |
But if single thread will meet the (~100%) bottleneck, I think multi-thread will also meet it, isn't it? |
@wangyoucao577 : Well, no. Why would it be that way? If a certain load causes a single thread to use 100%, one can spread it to several threads, and each one will take less, imho. |
Why? I can't understand it. Doesn't the multi-thread means more cpu cost from one thread to another? Why the result is single thread takes more cpu? In my understanding, such as the case iperf2 takes 20-60% on each thread, but sum of these threads will over 100%, isn't it? @blochl : |
@wangyoucao577 : I mean that ~100% of a single core is used. Multiple threads could have used less time, but on multiple cores. The bottleneck issue is only a speculation for now, but for testing the CPU and the network performance in parrallel this is undoubtedly important. |
The CPU-usage seems to be a bottleneck, yes.
If I fire up another set of iperf3 (different port) at the same time, we clearly see that the CPU is causing a bottleneck (and not the network stack, since we get double the bandwidth running two in parallel);
|
The other problem with this is iperf2 used to spin up a connection for each thread meaning different ephemeral ports. This causes hashing algorithms used in any network functionality to see many connections instead of a few and can help/hurt performance given your implementation. Looks like I can go up to 8 - but not more right now. |
I definitely agree that the option to span multiple cores -- which was present in iperf2 -- is useful in soak testing. One should note, however, that iperf2 achieved this only by using multiple client-server connections, with a thread being affine to each socket. However, there was some absolutely dog ugly logic in iperf2's implementation. Basically, it wrapped UDP sockets in its C++ implementation to work a little along the lines of how TCP accept() creates a new socket for an inbound flow. It could be done better, but it does strike me as a significant bit of work for iperf3 in its current incarnation. |
Performance issues with the multiple flows used by iperf2 might have been related to the lack of bottom-up affinity, or possibly even cache line effects. iperf2 wasn't aware of RSS or other mechanisms, and as far as I know, the only really portable way to pin socket workloads is by learning about the core/thread topology using something like hwloc, and then appropriate setsockopt()/platform CPU pinning APIs. The PCB hash[es] in the TCP/IP stacks themselves are usually pretty performant. unless you were cycling new connections, the hash management itself might not be a bump. I've whined about UDP in iperf2, but one advantage of its approach was that the socket each thread (per sub-flow in the measurement session) was using could explicity bind() and 'listen()' (wrapped) on each ephemeral port, instead of using recvfrom() directly. That might have modest cache benefit. |
The single threaded design is also a massive problem on embedded systems, like routers. I have a dual-core MIPS based router which can do 4 threads (DIR-860L), and I see that iperf3 is maxing out a single core and not reaching the gigabit speed. Definitely limited by the single threaded design. |
suggest you use iperf2 instead. |
iperf(2) create threads per stream, however, iperf3 has only one thread send multiple streams. |
Hi,
I've been using Iperf3 (latest code from GitHub) on a machine with 4 cores (8 threads) running Fedora 21.
I have noticed that even when testing with multiple streams (-P option) just two threads are used. The same, most basic, multistream test with Iperf2 utilizes all the cores.
Am I missing something? Any thoughts why it may behave this way?
Regards,
Leonid.
The text was updated successfully, but these errors were encountered: