Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreaded iperf3 #289

Closed
blochl opened this issue Aug 4, 2015 · 21 comments
Closed

Multithreaded iperf3 #289

blochl opened this issue Aug 4, 2015 · 21 comments
Assignees
Labels

Comments

@blochl
Copy link

blochl commented Aug 4, 2015

Hi,

I've been using Iperf3 (latest code from GitHub) on a machine with 4 cores (8 threads) running Fedora 21.
I have noticed that even when testing with multiple streams (-P option) just two threads are used. The same, most basic, multistream test with Iperf2 utilizes all the cores.
Am I missing something? Any thoughts why it may behave this way?

Regards,
Leonid.

@bmah888
Copy link
Contributor

bmah888 commented Aug 4, 2015

iperf3 is not multi-threaded, by design. Do you have any indication that your tests are CPU-bound?

@bmah888 bmah888 self-assigned this Aug 4, 2015
@wangyoucao577
Copy link

iperf3 is not multi-thread even if set the -P option?

@blochl
Copy link
Author

blochl commented Aug 5, 2015

This probably explains it.
+1 for wangyoucao577's question.
And I am curious: why is that so? Because Iperf2 is multi threaded.

@bmah888
Copy link
Contributor

bmah888 commented Aug 6, 2015

@wangyoucao577 , @blochl : iperf3 is a complete rewrite and shares very little code in common with iperf3. The intended use case was to test high-rate single-stream performance, typical of science workflows on R&E networks. This doesn't require multi-threading for parallel threads.

@blochl : You didn't answer my question yet about whether the single-threaded design of iperf3 actually created a problem for you.

@blochl
Copy link
Author

blochl commented Aug 6, 2015

In our test scenario the CPU load is also measured. With Iperf3, contrary to Iperf2, the CPU load is constantly ~100%, but on a single CPU, which does not give much indication on the CPU load as a function of, e.g., buffer size. This is the issue.

@bmah888
Copy link
Contributor

bmah888 commented Aug 6, 2015

OK. I understand what you're seeing. I believe that multi-threading iperf3 would be a non-trivial amount of work (the design predates my involvement with this project), although I admit I haven't really thought too much about it.

@bmah888 bmah888 changed the title Iperf3 CPU usage Multithreaded iperf3 Aug 6, 2015
@wangyoucao577
Copy link

I've also tried the case. If use iperf and set the '-P' option, I can find more than one thread. But If use iperf3 and set the '-P' option, only one thread for client. So that the behavior of '-P' is also different between iperf and iperf3. Can I ask why iperf3 give up the multi-thread implement? Is there any difference between use multi-thread or just multi-socket here?

@bmah888
Copy link
Contributor

bmah888 commented Aug 10, 2015

Folks, the single-threaded behavior of iperf3 is not a mystery, and it doesn't require any investigation on anyone's part. It wasn't designed to be multi-threaded, and the implementation reflects that. I'm not sure exactly why it was designed this way. Maybe someone more closely involved with the initial iperf3 work can shed some light on this (I'll ask around a little more actively when I get back from vacation).

@wangyoucao577
Copy link

OK. Thanks. I just want to know whether the multi-thread is necessary in some situation. Or maybe it's just not important.

@blochl
Copy link
Author

blochl commented Aug 10, 2015

@wangyoucao577 : Well, it is important if you would like to measure the CPU performance as the traffic is being generated. Maybe there are other cases.
And I wonder: if this single thread is saturated (~100%) does it mean that this might be a bottleneck?

@wangyoucao577
Copy link

But if single thread will meet the (~100%) bottleneck, I think multi-thread will also meet it, isn't it?

@blochl
Copy link
Author

blochl commented Aug 10, 2015

@wangyoucao577 : Well, no. Why would it be that way? If a certain load causes a single thread to use 100%, one can spread it to several threads, and each one will take less, imho.
Besides, experementally, with the same parameters, Iperf2 takes 20-60% on each thread, and uses all of them, while Iperf3 takes close to 100% on a single one.

@wangyoucao577
Copy link

Why? I can't understand it. Doesn't the multi-thread means more cpu cost from one thread to another? Why the result is single thread takes more cpu? In my understanding, such as the case iperf2 takes 20-60% on each thread, but sum of these threads will over 100%, isn't it?

@blochl :
Or maybe you mean that for multi cpu core pc, the multi-thread could use multi cpu cores to afford the test, but single-thread could only use 1 cpu core so that maybe it's not enough for the high network performance test?

@blochl
Copy link
Author

blochl commented Aug 11, 2015

@wangyoucao577 : I mean that ~100% of a single core is used. Multiple threads could have used less time, but on multiple cores. The bottleneck issue is only a speculation for now, but for testing the CPU and the network performance in parrallel this is undoubtedly important.

@joachimtingvold
Copy link

joachimtingvold commented Apr 26, 2016

The CPU-usage seems to be a bottleneck, yes.

root@foobar:~# iperf3 -c ::1 -i1 -t10 -w32M -P8
[SUM]   0.00-10.00  sec  30.9 GBytes  26.5 Gbits/sec                  receiver

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
14278 root       20   0  8948  3892  1208 R 101.  0.1  0:05.44 iperf3 -s
14279 root       20   0  8948  3868  1184 R 81.6  0.1  0:04.67 iperf3 -c ::1 -i1 -t10 -w32M -P8

If I fire up another set of iperf3 (different port) at the same time, we clearly see that the CPU is causing a bottleneck (and not the network stack, since we get double the bandwidth running two in parallel);

root@foobar:~# iperf3 -c ::1 -i1 -t10 -w32M -P8 -p5202
[SUM]   0.00-10.00  sec  31.3 GBytes  26.9 Gbits/sec                  receiver

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
14270 root       20   0  7412  2368  1216 R 101.  0.1  0:16.32 iperf3 -s -p 5202
14283 root       20   0  7412  2392  1220 R 101.  0.1  0:49.77 iperf3 -s
14287 root       20   0  7412  2396  1252 R 81.3  0.1  0:09.96 iperf3 -c ::1 -i1 -t10 -w32M -P8 -p5202
14286 root       20   0  7412  2308  1160 R 78.9  0.1  0:15.12 iperf3 -c ::1 -i1 -t10 -w32M -P8

@spsholleman
Copy link

The other problem with this is iperf2 used to spin up a connection for each thread meaning different ephemeral ports. This causes hashing algorithms used in any network functionality to see many connections instead of a few and can help/hurt performance given your implementation. Looks like I can go up to 8 - but not more right now.

@bms
Copy link

bms commented Jan 6, 2017

I definitely agree that the option to span multiple cores -- which was present in iperf2 -- is useful in soak testing. One should note, however, that iperf2 achieved this only by using multiple client-server connections, with a thread being affine to each socket.

However, there was some absolutely dog ugly logic in iperf2's implementation. Basically, it wrapped UDP sockets in its C++ implementation to work a little along the lines of how TCP accept() creates a new socket for an inbound flow.

It could be done better, but it does strike me as a significant bit of work for iperf3 in its current incarnation.

@bms
Copy link

bms commented Jan 6, 2017

Performance issues with the multiple flows used by iperf2 might have been related to the lack of bottom-up affinity, or possibly even cache line effects. iperf2 wasn't aware of RSS or other mechanisms, and as far as I know, the only really portable way to pin socket workloads is by learning about the core/thread topology using something like hwloc, and then appropriate setsockopt()/platform CPU pinning APIs.

The PCB hash[es] in the TCP/IP stacks themselves are usually pretty performant. unless you were cycling new connections, the hash management itself might not be a bump.

I've whined about UDP in iperf2, but one advantage of its approach was that the socket each thread (per sub-flow in the measurement session) was using could explicity bind() and 'listen()' (wrapped) on each ephemeral port, instead of using recvfrom() directly. That might have modest cache benefit.

@dchard
Copy link

dchard commented Feb 25, 2017

The single threaded design is also a massive problem on embedded systems, like routers. I have a dual-core MIPS based router which can do 4 threads (DIR-860L), and I see that iperf3 is maxing out a single core and not reaching the gigabit speed. Definitely limited by the single threaded design.

@bltierney
Copy link
Contributor

suggest you use iperf2 instead.

@RongDongsheng
Copy link

iperf(2) create threads per stream, however, iperf3 has only one thread send multiple streams.
So, you can see theads use htop w/ iperf -P x, but only 1 thead w/ iperf3 -P x.
Both iperf and iperf3 has multiple send port.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants