New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VPN server failure detection recovery #786
Conversation
Sources/NetworkProtection/Diagnostics/NetworkProtectionTunnelFailureMonitor.swift
Show resolved
Hide resolved
|
||
init(environment: VPNSettings.SelectedEnvironment = .default, isSubscriptionEnabled: Bool) { | ||
init(settings: VPNSettings, isSubscriptionEnabled: Bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change allowed us to instantiate the client so that we don’t need to re-init the NetworkProtectionDeviceManager
every time it's used. Should make testing a bit easier further down the line + also means we can guarantee the client always gets the most up-to-date environment setting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I love this, though. I don't think we should support real-time updating of the environment while the VPN is connected.
If anything, we should discourage it.
If we want to change the environment the whole VPN should be stopped instead.
- How necessary is this change? Can we roll it back?
- Do we ever expect to support changing the selected environment while the connection is already established?
Note: this wasn't a problem before because it was possible to change environments, but we no longer can support this, and should in fact remove all the code that allows this to happen when the connection is up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also feel like replacing the environment
param with settings
should at least be considered outside the scope of this PR/branch. It already does a lot and refactoring this might lead to side effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn’t fully explain the above. The reason why it’s important to not recreate the device manager every time is that we could end up with two register requests happening at the same time which could create racey behaviour. The NetworkProtectionDeviceManager
is an actor and it makes sense to just have one instance which will queue up these requests rather than creating new instances every time start, stop or failure recovery happens.
Rreplacing the environment
param with settings
was so that we could still support changing the environment without needing to restart the PacketTunnelProvider
, but @diegoreymendez and I had a chat and we think it’s actually probable more sensible to require restarting as the token we’re using would become invalid.
I’m going to revert the injection of settings
instead of environment
, but I will still create a single instance of the NetworkProtectionDeviceManager
. This will mean that we need to manually restart the tunnel for the env change to take effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looking good, added some comments - some of which need to be addressed.
I'll re-review and do some extended testing once it's ready again. Since this changes several key elements of the tunnel I want to be extra careful that everything is in place, so I'll be sinking some hours into this.
@@ -312,7 +312,17 @@ open class PacketTunnelProvider: NEPacketTunnelProvider { | |||
} | |||
}() | |||
|
|||
public lazy var tunnelFailureMonitor = NetworkProtectionTunnelFailureMonitor(tunnelProvider: self) | |||
lazy var deviceManager: NetworkProtectionDeviceManagement = NetworkProtectionDeviceManager( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicky, but can we make sure this is private
to prevent any form of misuse of this?
Sources/NetworkProtection/Diagnostics/NetworkProtectionTunnelFailureMonitor.swift
Show resolved
Hide resolved
private func handleFailureRecovery(tunnelConfig: TunnelConfiguration, server: NetworkProtectionServer) { | ||
self.lastSelectedServer = server | ||
Task { | ||
try await self.updateAdapterConfig(tunnelConfiguration: tunnelConfig, reassert: true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens here if the adapter fails to update and throws an error?
Should we add handling? Should we fire pixels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! #791
} catch { | ||
switch error { | ||
case FailureRecoveryError.noRecoveryNecessary: | ||
await reassertingControl?.stopReasserting() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes the bug I mentioned yesterday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this type of thing it's really important to ensure calls are always balanced. If a method starts, it also should stop on every exit. The default
error case doesn't stop reasserting again.
The reason why this is extremely important is because otherwise you can find issues like this one, where reasserting starts but never again stops.
I see two possible solutions to this:
- Add this right after starting the reasserting:
defer {
await reassertingControl?.stopReasserting()
}
- Since we're recovering through all the duration of the checks, why not start reasserting right before calling
incrementalPeriodicChecks
and stop reasserting right after that call? I guess this could be an issue if something else also stops reassertions, though... so option 1 may be better.
includedRoutes: includedRoutes ?? [], | ||
excludedRoutes: settings.excludedRanges, | ||
regenerateKey: regenerateKey) | ||
} catch { | ||
providerEvents.fire(.tunnelUpdateAttempt(.failure(error))) | ||
throw error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we’ll bubble up rekey failures as mentioned in https://app.asana.com/0/1203137811378537/1207149691501092/f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some testing. Works well so far but I'll be looking for edge cases.
Adding some comments to request changes.
The most important one is the one about stopping the reasserting.
} | ||
|
||
Task { | ||
guard let server = await self.lastSelectedServer else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we obtain the server before spawning the task, so we avoid spawning the task if there's no server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lastSelectedServer
is a @MainActor
var so this needs to happen in an async context.
|
||
init(environment: VPNSettings.SelectedEnvironment = .default, isSubscriptionEnabled: Bool) { | ||
init(settings: VPNSettings, isSubscriptionEnabled: Bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I love this, though. I don't think we should support real-time updating of the environment while the VPN is connected.
If anything, we should discourage it.
If we want to change the environment the whole VPN should be stopped instead.
- How necessary is this change? Can we roll it back?
- Do we ever expect to support changing the selected environment while the connection is already established?
Note: this wasn't a problem before because it was possible to change environments, but we no longer can support this, and should in fact remove all the code that allows this to happen when the connection is up.
} catch { | ||
switch error { | ||
case FailureRecoveryError.noRecoveryNecessary: | ||
await reassertingControl?.stopReasserting() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this type of thing it's really important to ensure calls are always balanced. If a method starts, it also should stop on every exit. The default
error case doesn't stop reasserting again.
The reason why this is extremely important is because otherwise you can find issues like this one, where reasserting starts but never again stops.
I see two possible solutions to this:
- Add this right after starting the reasserting:
defer {
await reassertingControl?.stopReasserting()
}
- Since we're recovering through all the duration of the checks, why not start reasserting right before calling
incrementalPeriodicChecks
and stop reasserting right after that call? I guess this could be an issue if something else also stops reassertions, though... so option 1 may be better.
} | ||
count += 1 | ||
currentDelay = min((currentDelay * config.factor), config.maxDelay) | ||
} while count < config.times |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the expected behavior if this is exceeded?
My understanding from what I'm seeing is that:
- We try recovery 5 times (Each time we duplicate the previous delay, but never wait more than 5 minutes)
- Once the attempts all fail, we bail out, and the failure monitor will keep checking periodically, but we won't do any more active recovery.
Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is correct. I was initially throwing an error that gets bubbled up to the PacketTunnelProvider. But I wasn’t reacting to it in any way so decided it wasn’t necessary. I guess we could send another pixel in that case, but I’d rather not just now as it wasn’t in the original scope and will require another privacy triage etc.
return | ||
} catch { | ||
os_log("🟢 Failure recovery failed. Retrying...", log: .networkProtectionServerFailureRecoveryLog, type: .info) | ||
try? await Task.sleep(interval: currentDelay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be ignored as its throw implies the task has been cancelled.
var currentDelay = config.initialDelay | ||
var count = 0 | ||
repeat { | ||
do { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since task cancellation is cooperative, right before the do
I'd suggest doing:
try Task.checkCancellation()
Sources/NetworkProtectionTestUtils/MockNetworkProtectionDeviceManagement.swift
Show resolved
Hide resolved
@@ -59,12 +63,12 @@ public actor NetworkProtectionDeviceManager: NetworkProtectionDeviceManagement { | |||
|
|||
private let isSubscriptionEnabled: Bool | |||
|
|||
public init(environment: VPNSettings.SelectedEnvironment, | |||
public init(settings: VPNSettings, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above this should be rolled back - we can't support updating the environment without restarting the connection. This change can give the reader the false impression that's possible.
|
||
init(environment: VPNSettings.SelectedEnvironment = .default, isSubscriptionEnabled: Bool) { | ||
init(settings: VPNSettings, isSubscriptionEnabled: Bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also feel like replacing the environment
param with settings
should at least be considered outside the scope of this PR/branch. It already does a lot and refactoring this might lead to side effect.
eventHandler(.completed(.unhealthy)) | ||
await reassertingControl?.stopReasserting() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this snippet, the event is emitted before the stopReasserting()
is invoked. In the catch
block, it happens after the stopReasserting()
call. The order of operations should be consistent, if only for readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, can change that.
Co-authored-by: Anh Do <18567+quanganhdo@users.noreply.github.com>
Co-authored-by: Anh Do <18567+quanganhdo@users.noreply.github.com>
…lure-detection-recovery
Please review the release process for BrowserServicesKit here.
Required:
Task/Issue URL: https://app.asana.com/0/72649045549333/1206669340827392/f
iOS PR: duckduckgo/iOS#2779
macOS PR: https://github.com/duckduckgo/macos-browser/pull/2657
What kind of version bump will this require?: Major
Description:
Steps to test this PR:
Network Protection
. The failure recovery logs are indicated with a 🟢 so you can also look for them.OS Testing:
—
Internal references:
Software Engineering Expectations
Technical Design Template