Connection between Peers Fails under higher Loads #2216

joru1407 · 2024-02-03T10:53:48Z

We’re experiencing that the network connection between Peers on a ZT network fails under higher loads with many connections.

In our case we‘re using ZT to connect a Windows Server with a Backup Storage and we‘re using Veeam for Backups. In the default config, Veeam uses many simultaneous connections to exhaust the maximum bandwidth. The available bandwidth between the two peers is around 900 Mbit/s. If we run the job, the ZT connection gets really bad (Pings with 8x time, packet loss) and after a few minutes it drops completely and recovers after the Backup Job finally fails.

If we configure Veeam to only use a single connection and if we limit the bandwidth to around 600 Mbit/s everything works and the ZT network keeps stable.

I‘d expect ZT to handle high traffic loads correctly, but currently it seems that high loads / especially many connections are a problem for ZT.

joseph-henry · 2024-02-05T17:25:41Z

Interesting. Some questions:

(1) How many streams is Veeam creating and using during these congestion incidents?
(2) When the backup job seems to be failing is the underlying ZT link still working? This is important because the Veeam job may fail because of its own internal timing logic when a ZT link is congested and that's different than if a ZT link it is unresponsive.

Try stopping the job when it appears congested and try to send some data between the two ZT nodes by some other method like iperf

joru1407 · 2024-02-05T17:32:28Z

The default setting of Veeam is 5 Streams, which is enough to produce the issue.
When the Backupjob is failing, the underlying ZT link is also down. We did pings between the two Hosts during a running Backup Job and you can see how the Ping time first gets worse, than the Pings start failing completely and only after this the Backup Job fails because of a connection loss.
If we run the same job directly it works like a charm and we did also test it with a Tailscale network, which works without problems too.

I'll try to reproduce a similar behaviour without Veeam later. I'm not sure if Veeam is doing anything special, but it's definitely taking down a ZT Connection completely 😁

joseph-henry · 2024-02-05T17:36:24Z

Ok thanks for the info, but when the backup job seems to be failing the only way we can tell if the ZT link is actually down is if we cease the job and then try to use the link with something else. Otherwise a ping is just another packet competing for a scarce resource that it probably won't get. Can you try that?

I'll try to do some saturation testing on my end.

joru1407 · 2024-02-05T17:50:33Z

As soon as i cease the Backup Job, it takes about 10 seconds and the link starts working again, so it is only temporarily failing. I'll report later on some test results.

laduke · 2024-02-05T18:15:25Z

We've seen something similar. If you make 2 small, single CPU VMs and then iperf between them, they fall over.

Capping the bandwidth on the zt interface to some lower number seemed to avoid it. You need to experiment a little to find the best limit. I'm not familiar with how to do it on Windows. It might be a little more tricky if it's incoming traffic. Maybe Veeam has configuration options for that.

You also might be able to run multiple instances of ZeroTier on your server. 1 per CPU core?
We don't have instructions for that that I'm aware of. Someone should do that.

Sorry to interrupt. Just adding a little more context.

joru1407 · 2024-02-05T20:10:23Z

I'm not able to reproduce the issue between the two same hosts with iperf and multiple connections only. Also in a test environment with 2 vCPUs and 50 iperf connections I'm getting 1.38 Gbit/s without big congestion problems.

Somehow Veeam is triggering something that ZT doesn't like, it seems to be related to simultaneous streams but not only to this. Wireguard or Tailscale are not affected in a similar way. Could it be related to ZTs MTU of 2800, Jumbo Frames, Layer 2?

Sounds stupid, but to me it seems as if the ZT connection is saying to Veeam: Bring it on, I have a lot more bandwidth, I'll handle it, but then drops everything because the actual link is extremly congested and queuing is no longer possible. As if Veeam can't detect how congested the link already is.

Any idea how to test or debug it further?

joseph-henry self-assigned this Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection between Peers Fails under higher Loads #2216

Connection between Peers Fails under higher Loads #2216

joru1407 commented Feb 3, 2024 •

edited

joseph-henry commented Feb 5, 2024

joru1407 commented Feb 5, 2024 •

edited

joseph-henry commented Feb 5, 2024

joru1407 commented Feb 5, 2024

laduke commented Feb 5, 2024

joru1407 commented Feb 5, 2024 •

edited

Connection between Peers Fails under higher Loads #2216

Connection between Peers Fails under higher Loads #2216

Comments

joru1407 commented Feb 3, 2024 • edited

joseph-henry commented Feb 5, 2024

joru1407 commented Feb 5, 2024 • edited

joseph-henry commented Feb 5, 2024

joru1407 commented Feb 5, 2024

laduke commented Feb 5, 2024

joru1407 commented Feb 5, 2024 • edited

joru1407 commented Feb 3, 2024 •

edited

joru1407 commented Feb 5, 2024 •

edited

joru1407 commented Feb 5, 2024 •

edited