Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(snownet): only send relay candidates if hole-punching fails #4268

Closed
wants to merge 14 commits into from

Conversation

thomaseizinger
Copy link
Member

@thomaseizinger thomaseizinger commented Mar 22, 2024

Previously, we used to send all candidates as soon as we discover them. Whilst that may appear great for initial connectivity, it is fairly wasteful on both ends of the connection. In almost all cases, we are able to hole-punch a connection making the exchange of most candidates and the resulting binding of channels completely unnecessary.

To achieve this, we wait for 2 seconds before signaling relay candidates to the other party. In case we nominate a socket within those two seconds, we discard all those relay candidates because they are unnecessary.

Resolves: #4164.

Copy link

vercel bot commented Mar 22, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
firezone ⬜️ Ignored (Inspect) Visit Preview Apr 15, 2024 1:13am

Copy link

github-actions bot commented Mar 22, 2024

Terraform Cloud Plan Output

Plan: 15 to add, 14 to change, 15 to destroy.

Terraform Cloud Plan

@thomaseizinger thomaseizinger marked this pull request as draft March 22, 2024 12:42
@thomaseizinger
Copy link
Member Author

Still need to test this properly.

@jamilbk jamilbk changed the title feat(snownet): only send relay candidates if hole-punching fails refactor(snownet): only send relay candidates if hole-punching fails Mar 22, 2024
github-merge-queue bot pushed a commit that referenced this pull request Apr 10, 2024
This bump includes a fix that triggers a panic on unknown interfaces
(algesten/str0m#493). The panic is what is
currently blocking #4268 from
proceeding.
@thomaseizinger thomaseizinger force-pushed the feat/connlib/delay-relay-candidates branch from 691f8d5 to 144ed32 Compare April 10, 2024 09:22
Copy link

github-actions bot commented Apr 10, 2024

Performance Test Results

TCP

Test Name Received/s Sent/s Retransmits
direct-tcp-client2server 245.5 MiB (+1%) 248.0 MiB (+1%) 482 (+56%)
direct-tcp-server2client 239.6 MiB (-1%) 240.7 MiB (-1%) 241 (+40%)
relayed-tcp-client2server 223.5 MiB (-0%) 224.5 MiB (+0%) 239 (-27%)
relayed-tcp-server2client 239.8 MiB (+5%) 240.4 MiB (+5%) 272 (-14%)

UDP

Test Name Total/s Jitter Lost
direct-udp-client2server 50.0 MiB (-0%) 0.04ms (+71%) 0.00% (NaN%)
direct-udp-server2client 50.0 MiB (+0%) 0.01ms (+59%) 0.00% (NaN%)
relayed-udp-client2server 50.0 MiB (+0%) 0.05ms (-4%) 0.00% (NaN%)
relayed-udp-server2client 50.0 MiB (-0%) 0.03ms (+376%) 0.00% (NaN%)

@thomaseizinger thomaseizinger force-pushed the feat/connlib/delay-relay-candidates branch from 144ed32 to 2de7ade Compare April 11, 2024 04:47
@thomaseizinger thomaseizinger marked this pull request as ready for review April 11, 2024 04:47
@thomaseizinger thomaseizinger marked this pull request as draft April 11, 2024 04:51
@thomaseizinger thomaseizinger force-pushed the feat/connlib/delay-relay-candidates branch from 2de7ade to ebbd9f9 Compare April 11, 2024 04:55
@thomaseizinger thomaseizinger changed the base branch from main to chore/snownet/test-harness-improvements April 11, 2024 04:55
@thomaseizinger
Copy link
Member Author

Resolves: #4290.

@jamilbk I am optimistically marking this as resolving the above issue. Given that we can't reproduce directly why it happens, this is a bit of a gamble. We can always re-open it if it appears again.

Copy link
Member

@jamilbk jamilbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably not understanding something, but can't we send relay candidates at the start and just wait 2 seconds to use them if holepunching fails?

I.e. do this in parallel rather than adding 2 seconds for every connection that's stuck on a Relay.

When that happens it'll likely be for an entire org or a user's home office or something.

@thomaseizinger
Copy link
Member Author

thomaseizinger commented Apr 11, 2024

I'm probably not understanding something, but can't we send relay candidates at the start and just wait 2 seconds to use them if holepunching fails?

The difference here is only the RTT of sending the candidate to the portal, right? I'd expect that to be on the order of 50ms or less in total. Is that really worth it?

Or what do you mean by "use" them? As soon as I send a relay candidate to the gateway, it will test its connectivity and might nominate it which might lead to #4290 if the direct connection fails somehow. I can temporarily hold it on the gateway side but that isn't any different from holding it on the client side modulo the RTT of the websocket connection.

I.e. do this in parallel rather than adding 2 seconds for every connection that's stuck on a Relay.

Do what in parallel?

Note that most connections take a second to be established (a good chunk of which is setup with the portal of requesting gateways etc). We can reduce the delay to 1s but that could lead to some connections still testing relay candidates even though the direct connection is about to succeed.

Also note the comments in #4164.

@thomaseizinger thomaseizinger force-pushed the feat/connlib/delay-relay-candidates branch from ebbd9f9 to 129e175 Compare April 11, 2024 05:49
@jamilbk
Copy link
Member

jamilbk commented Apr 11, 2024

Ah, got it. Candidates have already been gathered, this just pauses the final step.

Base automatically changed from chore/snownet/test-harness-improvements to main April 11, 2024 14:28
@thomaseizinger thomaseizinger force-pushed the feat/connlib/delay-relay-candidates branch from 129e175 to 7280b23 Compare April 12, 2024 00:12
@thomaseizinger thomaseizinger marked this pull request as ready for review April 12, 2024 00:12
@thomaseizinger
Copy link
Member Author

@conectado This is ready for review. I tried to make meaningful commits.

Comment on lines +150 to +152
for connection in self.connections.established.values_mut() {
connection.created_at = now;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of critical. Upon re-connect, we need to reset this timestamp to ensure we have a new grace-period and don't immediately emit all (new) relay candidates. Once #4585 is merged, I'll try to add a unit-test for this.

@conectado
Copy link
Collaborator

I don't understand how this fixes #4290, I remember that we have priorities set where we always pick a direct connection over a relayed one, and str0m should test all candidate pairs regardless the order they arrive?

Copy link
Collaborator

@conectado conectado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, though I'm not sure if this will feel a bit less responsive when direct isn't possible, could 2 seconds be too high of a timeout?

@thomaseizinger
Copy link
Member Author

I don't understand how this fixes #4290, I remember that we have priorities set where we always pick a direct connection over a relayed one, and str0m should test all candidate pairs regardless the order they arrive?

I can remove the "resolves" again and close the other one as not reproducible if you think that is more correct?

In theory, ICE should always pick the best one, yes. I don't know why we sometimes don't do that. What I do know is that we currently do a lot of unnecessary work like allocating channels on relays repeatedly and testing their connectivity when in most cases, a direct connection is possible.

We can optimise this in the future by detecting and remembering our NAT status eagerly, see #4060.

@jamilbk
Copy link
Member

jamilbk commented Apr 12, 2024

I don't know why we sometimes don't do that.

FWIW, I haven't seen this happen ever since we fixed the Gateway and Phoenix Channel ref bugs. That could have fixed this by virtue of ensuring all candidates arrive reliably.

Another issue I haven't seen reproduce since the above was fixed is #4058

This PR may be fixing an issue that no longer exists.

@thomaseizinger
Copy link
Member Author

This PR may be fixing an issue that no longer exists.

I've removed the 2nd "resolves".

With that taken into account, this PR is primarily an optimisation to avoid wasteful allocation of channels on relays. Not only does that create a lot of chatter in the logs, it also represents an unnecessarily limit of how many connections a single gateway can handle because there is no way of expiring a channel binding and currently, we bind dozens of them on every connection.

With this in place, I would also like to remove the following optimisation:

match candidate.kind() {
CandidateKind::Host => {
// Binding a TURN channel for host candidates does not make sense.
// They are only useful to circumvent restrictive NATs in which case we are either talking to another relay candidate or a server-reflexive address.
return;
}
CandidateKind::Relayed => {
// Optimisatically try to bind the channel only on the same relay as the remote peer.
if let Some(allocation) = self.same_relay_as_peer(&candidate) {
allocation.bind_channel(candidate.addr(), now);
return;
}
}
CandidateKind::ServerReflexive | CandidateKind::PeerReflexive => {}
}

This currently makes assumptions about which candidates are allowed to talk to each other. I do think it is sound but I would like to not make assumptions about network topologies. I think this optimisation makes much more sense. We should first try to hole-punch and if that doesn't succeed, try the relays.

@conectado
Copy link
Collaborator

I can remove the "resolves" again and close the other one as not reproducible if you think that is more correct?

yeah :)

@thomaseizinger
Copy link
Member Author

I can remove the "resolves" again and close the other one as not reproducible if you think that is more correct?

yeah :)

Done!

@thomaseizinger
Copy link
Member Author

could 2 seconds be too high of a timeout?

Yeah, doing this time-based isn't ideal. I'd like to work on #4060 which allows us to not pay this time-penality when we've previously detected that we are behind a restrictive NAT.

@thomaseizinger thomaseizinger force-pushed the feat/connlib/delay-relay-candidates branch from 7280b23 to f601ed2 Compare April 15, 2024 01:13
@github-actions github-actions bot added the kind/refactor Code refactoring label Apr 15, 2024
@thomaseizinger
Copy link
Member Author

thomaseizinger commented Apr 15, 2024

I think the tests are currently failing for the same reason as algesten/str0m#496. The non-roaming party doesn't know that the candidate was invalidated and thus keeps sending messages to it.

@thomaseizinger thomaseizinger marked this pull request as draft April 15, 2024 08:13
@algesten
Copy link

I obviously don't know the full story here, but from a birds eye perspective it appears like working against the spirit of the ICE agent.

The whole point of trickle ice is to be able to start connectivity early, before the complete enumeration of NIC and relays are discovered. It's a "throw all at the wall and see what sticks" kind of approach where everything is thrown as soon as it's found. Putting in a deliberate delay appears a bit counter to that idea.

@thomaseizinger
Copy link
Member Author

I obviously don't know the full story here, but from a birds eye perspective it appears like working against the spirit of the ICE agent.

The whole point of trickle ice is to be able to start connectivity early, before the complete enumeration of NIC and relays are discovered. It's a "throw all at the wall and see what sticks" kind of approach where everything is thrown as soon as it's found. Putting in a deliberate delay appears a bit counter to that idea.

I am aware that we are bending ICE here a bit and I am open to other solutions. The side-effect of testing connectivity over a relay candidate is that we need to bind a channel (our TURN server doesn't support DATA indications). Not just one channel though, for each remote candidate that we receive, we need to allocate a channel on each allocation! This multiplies up to dozens if not hundreds of channels for each connection as soon as you use a few relays (for redundancy reasons for example).

It seems somewhat wasteful to do that and might present a bottle-neck of how many connections we can establish per second because there are only 16k channels per allocation and they only expire after 10 minutes (+ a 5 minute grace period before they can be rebound). A node must therefore at most use ~17 channels per connection to be able to establish 1 connection per second.

There might be other ways to optimise this but it seems the easiest to just first try hole-punching and if that doesn't succeed for a whille, send relay candidates. In the future, I am planning to more clever on this by detecting the NAT status early (#4060) and thus avoid this time penality for relayed connections.

@thomaseizinger
Copy link
Member Author

Outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/refactor Code refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

snownet: stagger sending of candidates based on connection state
4 participants