Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection between Peers Fails under higher Loads #2216

Open
joru1407 opened this issue Feb 3, 2024 · 6 comments
Open

Connection between Peers Fails under higher Loads #2216

joru1407 opened this issue Feb 3, 2024 · 6 comments
Assignees

Comments

@joru1407
Copy link

joru1407 commented Feb 3, 2024

We’re experiencing that the network connection between Peers on a ZT network fails under higher loads with many connections.

In our case we‘re using ZT to connect a Windows Server with a Backup Storage and we‘re using Veeam for Backups. In the default config, Veeam uses many simultaneous connections to exhaust the maximum bandwidth. The available bandwidth between the two peers is around 900 Mbit/s. If we run the job, the ZT connection gets really bad (Pings with 8x time, packet loss) and after a few minutes it drops completely and recovers after the Backup Job finally fails.

If we configure Veeam to only use a single connection and if we limit the bandwidth to around 600 Mbit/s everything works and the ZT network keeps stable.

I‘d expect ZT to handle high traffic loads correctly, but currently it seems that high loads / especially many connections are a problem for ZT.

@joseph-henry
Copy link
Contributor

Interesting. Some questions:

(1) How many streams is Veeam creating and using during these congestion incidents?
(2) When the backup job seems to be failing is the underlying ZT link still working? This is important because the Veeam job may fail because of its own internal timing logic when a ZT link is congested and that's different than if a ZT link it is unresponsive.

Try stopping the job when it appears congested and try to send some data between the two ZT nodes by some other method like iperf

@joseph-henry joseph-henry self-assigned this Feb 5, 2024
@joru1407
Copy link
Author

joru1407 commented Feb 5, 2024

  1. The default setting of Veeam is 5 Streams, which is enough to produce the issue.
  2. When the Backupjob is failing, the underlying ZT link is also down. We did pings between the two Hosts during a running Backup Job and you can see how the Ping time first gets worse, than the Pings start failing completely and only after this the Backup Job fails because of a connection loss.
  3. If we run the same job directly it works like a charm and we did also test it with a Tailscale network, which works without problems too.

I'll try to reproduce a similar behaviour without Veeam later. I'm not sure if Veeam is doing anything special, but it's definitely taking down a ZT Connection completely 😁

@joseph-henry
Copy link
Contributor

Ok thanks for the info, but when the backup job seems to be failing the only way we can tell if the ZT link is actually down is if we cease the job and then try to use the link with something else. Otherwise a ping is just another packet competing for a scarce resource that it probably won't get. Can you try that?

I'll try to do some saturation testing on my end.

@joru1407
Copy link
Author

joru1407 commented Feb 5, 2024

As soon as i cease the Backup Job, it takes about 10 seconds and the link starts working again, so it is only temporarily failing. I'll report later on some test results.

@laduke
Copy link
Contributor

laduke commented Feb 5, 2024

We've seen something similar. If you make 2 small, single CPU VMs and then iperf between them, they fall over.

Capping the bandwidth on the zt interface to some lower number seemed to avoid it. You need to experiment a little to find the best limit. I'm not familiar with how to do it on Windows. It might be a little more tricky if it's incoming traffic. Maybe Veeam has configuration options for that.

You also might be able to run multiple instances of ZeroTier on your server. 1 per CPU core?
We don't have instructions for that that I'm aware of. Someone should do that.

Sorry to interrupt. Just adding a little more context.

@joru1407
Copy link
Author

joru1407 commented Feb 5, 2024

I'm not able to reproduce the issue between the two same hosts with iperf and multiple connections only. Also in a test environment with 2 vCPUs and 50 iperf connections I'm getting 1.38 Gbit/s without big congestion problems.

Somehow Veeam is triggering something that ZT doesn't like, it seems to be related to simultaneous streams but not only to this. Wireguard or Tailscale are not affected in a similar way. Could it be related to ZTs MTU of 2800, Jumbo Frames, Layer 2?

Sounds stupid, but to me it seems as if the ZT connection is saying to Veeam: Bring it on, I have a lot more bandwidth, I'll handle it, but then drops everything because the actual link is extremly congested and queuing is no longer possible. As if Veeam can't detect how congested the link already is.

Any idea how to test or debug it further?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants