Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VPN server failure detection recovery #786

Merged
merged 44 commits into from Apr 30, 2024

Conversation

graeme
Copy link
Contributor

@graeme graeme commented Apr 17, 2024

Please review the release process for BrowserServicesKit here.

Required:

Task/Issue URL: https://app.asana.com/0/72649045549333/1206669340827392/f
iOS PR: duckduckgo/iOS#2779
macOS PR: https://github.com/duckduckgo/macos-browser/pull/2657
What kind of version bump will this require?: Major

Description:

  • Use the WireGuard handshakes to determine when a connection fails by checking that the last handshake time is less than 300 seconds (just over two handshakes).
  • When a failure is detected the client will call /register with server set to the name of the current (failed) server and mode set to failureRecovery. If the server is not in the "active" status on the controller a new registration will be returned otherwise the previous data will be returned (this data should be checked against the current tunnel data including allowedIPs).

Steps to test this PR:

  1. Launch the VPN and connect
  2. Block UDP traffic by following these instructions: https://app.asana.com/0/0/1207029295107667/f
  3. Open the Console.app and filter by Category: Network Protection. The failure recovery logs are indicated with a 🟢 so you can also look for them.
  4. After around 5 minutes you will see logs indicating that a register request has been made.
  5. It will error with no recovery required as there is not a real problem.

OS Testing:

  • iOS 14
  • iOS 15
  • iOS 16
  • macOS 10.15
  • macOS 11
  • macOS 12

Internal references:

Software Engineering Expectations
Technical Design Template


init(environment: VPNSettings.SelectedEnvironment = .default, isSubscriptionEnabled: Bool) {
init(settings: VPNSettings, isSubscriptionEnabled: Bool) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change allowed us to instantiate the client so that we don’t need to re-init the NetworkProtectionDeviceManager every time it's used. Should make testing a bit easier further down the line + also means we can guarantee the client always gets the most up-to-date environment setting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I love this, though. I don't think we should support real-time updating of the environment while the VPN is connected.

If anything, we should discourage it.

If we want to change the environment the whole VPN should be stopped instead.

  1. How necessary is this change? Can we roll it back?
  2. Do we ever expect to support changing the selected environment while the connection is already established?

Note: this wasn't a problem before because it was possible to change environments, but we no longer can support this, and should in fact remove all the code that allows this to happen when the connection is up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel like replacing the environment param with settings should at least be considered outside the scope of this PR/branch. It already does a lot and refactoring this might lead to side effect.

Copy link
Contributor Author

@graeme graeme Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t fully explain the above. The reason why it’s important to not recreate the device manager every time is that we could end up with two register requests happening at the same time which could create racey behaviour. The NetworkProtectionDeviceManager is an actor and it makes sense to just have one instance which will queue up these requests rather than creating new instances every time start, stop or failure recovery happens.

Rreplacing the environment param with settings was so that we could still support changing the environment without needing to restart the PacketTunnelProvider, but @diegoreymendez and I had a chat and we think it’s actually probable more sensible to require restarting as the token we’re using would become invalid.

I’m going to revert the injection of settings instead of environment, but I will still create a single instance of the NetworkProtectionDeviceManager. This will mean that we need to manually restart the tunnel for the env change to take effect.

@graeme graeme marked this pull request as ready for review April 18, 2024 15:38
Copy link
Contributor

@diegoreymendez diegoreymendez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looking good, added some comments - some of which need to be addressed.

I'll re-review and do some extended testing once it's ready again. Since this changes several key elements of the tunnel I want to be extra careful that everything is in place, so I'll be sinking some hours into this.

@@ -312,7 +312,17 @@ open class PacketTunnelProvider: NEPacketTunnelProvider {
}
}()

public lazy var tunnelFailureMonitor = NetworkProtectionTunnelFailureMonitor(tunnelProvider: self)
lazy var deviceManager: NetworkProtectionDeviceManagement = NetworkProtectionDeviceManager(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicky, but can we make sure this is private to prevent any form of misuse of this?

Sources/NetworkProtection/PacketTunnelProvider.swift Outdated Show resolved Hide resolved
Sources/NetworkProtection/PacketTunnelProvider.swift Outdated Show resolved Hide resolved
Sources/NetworkProtection/PacketTunnelProvider.swift Outdated Show resolved Hide resolved
Sources/NetworkProtection/PacketTunnelProvider.swift Outdated Show resolved Hide resolved
private func handleFailureRecovery(tunnelConfig: TunnelConfiguration, server: NetworkProtectionServer) {
self.lastSelectedServer = server
Task {
try await self.updateAdapterConfig(tunnelConfiguration: tunnelConfig, reassert: true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here if the adapter fails to update and throws an error?

Should we add handling? Should we fire pixels?

Copy link
Contributor Author

@graeme graeme Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! #791

} catch {
switch error {
case FailureRecoveryError.noRecoveryNecessary:
await reassertingControl?.stopReasserting()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes the bug I mentioned yesterday.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this type of thing it's really important to ensure calls are always balanced. If a method starts, it also should stop on every exit. The default error case doesn't stop reasserting again.

The reason why this is extremely important is because otherwise you can find issues like this one, where reasserting starts but never again stops.

I see two possible solutions to this:

  1. Add this right after starting the reasserting:
defer {
    await reassertingControl?.stopReasserting()
}
  1. Since we're recovering through all the duration of the checks, why not start reasserting right before calling incrementalPeriodicChecks and stop reasserting right after that call? I guess this could be an issue if something else also stops reassertions, though... so option 1 may be better.

includedRoutes: includedRoutes ?? [],
excludedRoutes: settings.excludedRanges,
regenerateKey: regenerateKey)
} catch {
providerEvents.fire(.tunnelUpdateAttempt(.failure(error)))
throw error
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we’ll bubble up rekey failures as mentioned in https://app.asana.com/0/1203137811378537/1207149691501092/f

Copy link
Contributor

@diegoreymendez diegoreymendez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some testing. Works well so far but I'll be looking for edge cases.

Adding some comments to request changes.

The most important one is the one about stopping the reasserting.

Sources/NetworkProtection/PacketTunnelProvider.swift Outdated Show resolved Hide resolved
}

Task {
guard let server = await self.lastSelectedServer else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we obtain the server before spawning the task, so we avoid spawning the task if there's no server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastSelectedServer is a @MainActor var so this needs to happen in an async context.


init(environment: VPNSettings.SelectedEnvironment = .default, isSubscriptionEnabled: Bool) {
init(settings: VPNSettings, isSubscriptionEnabled: Bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I love this, though. I don't think we should support real-time updating of the environment while the VPN is connected.

If anything, we should discourage it.

If we want to change the environment the whole VPN should be stopped instead.

  1. How necessary is this change? Can we roll it back?
  2. Do we ever expect to support changing the selected environment while the connection is already established?

Note: this wasn't a problem before because it was possible to change environments, but we no longer can support this, and should in fact remove all the code that allows this to happen when the connection is up.

} catch {
switch error {
case FailureRecoveryError.noRecoveryNecessary:
await reassertingControl?.stopReasserting()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this type of thing it's really important to ensure calls are always balanced. If a method starts, it also should stop on every exit. The default error case doesn't stop reasserting again.

The reason why this is extremely important is because otherwise you can find issues like this one, where reasserting starts but never again stops.

I see two possible solutions to this:

  1. Add this right after starting the reasserting:
defer {
    await reassertingControl?.stopReasserting()
}
  1. Since we're recovering through all the duration of the checks, why not start reasserting right before calling incrementalPeriodicChecks and stop reasserting right after that call? I guess this could be an issue if something else also stops reassertions, though... so option 1 may be better.

}
count += 1
currentDelay = min((currentDelay * config.factor), config.maxDelay)
} while count < config.times
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the expected behavior if this is exceeded?

My understanding from what I'm seeing is that:

  1. We try recovery 5 times (Each time we duplicate the previous delay, but never wait more than 5 minutes)
  2. Once the attempts all fail, we bail out, and the failure monitor will keep checking periodically, but we won't do any more active recovery.

Is this correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is correct. I was initially throwing an error that gets bubbled up to the PacketTunnelProvider. But I wasn’t reacting to it in any way so decided it wasn’t necessary. I guess we could send another pixel in that case, but I’d rather not just now as it wasn’t in the original scope and will require another privacy triage etc.

return
} catch {
os_log("🟢 Failure recovery failed. Retrying...", log: .networkProtectionServerFailureRecoveryLog, type: .info)
try? await Task.sleep(interval: currentDelay)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be ignored as its throw implies the task has been cancelled.

var currentDelay = config.initialDelay
var count = 0
repeat {
do {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since task cancellation is cooperative, right before the do I'd suggest doing:

try Task.checkCancellation()

@@ -59,12 +63,12 @@ public actor NetworkProtectionDeviceManager: NetworkProtectionDeviceManagement {

private let isSubscriptionEnabled: Bool

public init(environment: VPNSettings.SelectedEnvironment,
public init(settings: VPNSettings,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above this should be rolled back - we can't support updating the environment without restarting the connection. This change can give the reader the false impression that's possible.


init(environment: VPNSettings.SelectedEnvironment = .default, isSubscriptionEnabled: Bool) {
init(settings: VPNSettings, isSubscriptionEnabled: Bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel like replacing the environment param with settings should at least be considered outside the scope of this PR/branch. It already does a lot and refactoring this might lead to side effect.

Sources/NetworkProtection/PacketTunnelProvider.swift Outdated Show resolved Hide resolved
Comment on lines 102 to 103
eventHandler(.completed(.unhealthy))
await reassertingControl?.stopReasserting()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this snippet, the event is emitted before the stopReasserting() is invoked. In the catch block, it happens after the stopReasserting() call. The order of operations should be consistent, if only for readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, can change that.

@graeme graeme merged commit 2681b52 into main Apr 30, 2024
7 checks passed
@graeme graeme deleted the graeme/vpn-server-failure-detection-recovery branch April 30, 2024 09:01
samsymons added a commit that referenced this pull request May 1, 2024
* main:
  Add workflow to create Asana PR subtask when GitHub reviewer is assigned (#805)
  VPN server failure detection recovery (#786)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants