Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make VPN Network configurable #9597

Closed
wants to merge 19 commits into from

Conversation

timebertt
Copy link
Member

How to categorize this PR?

/area networking
/kind enhancement

What this PR does / why we need it:

This PR makes use of gardener/vpn2#78 to allow configuring the used VPN network on the seed-level via Seed.spec.networks.vpn.
The field is defaulted to the current hard-coded values and non-default network CIDRs must have the same size as the current hard-coded values. Network disjointedness with other seed and shoot networks is ensured in validation, admission, and scheduling.

See the umbrella issue and the doc commit for more details on motivation and conditions.

Which issue(s) this PR fixes:
Fixes #8987

Special notes for your reviewer:
Supersedes #8991

Release note:

Operators can configure the network range of the shoot VPN tunnel using the new `Seed.spec.networks.vpn` field. See [this document](https://github.com/gardener/gardener/blob/master/docs/usage/reversed-vpn-tunnel.md#configurable-vpn-network) for more details.

@gardener-prow gardener-prow bot added area/networking Networking related kind/enhancement Enhancement, improvement, extension labels Apr 16, 2024
@gardener-prow gardener-prow bot requested review from acumino and timuthy April 16, 2024 15:45
@gardener-prow gardener-prow bot added cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 16, 2024
@timebertt
Copy link
Member Author

TL;DR: I pushed a new commit which should fix the e2e test failures for HA clusters.


Explanation:
I figured out why the e2e tests for HA clusters were consistently failing.
The gardenlet logs pointed towards kube-apiserver being unable to talk to metrics-server:

2024-04-16T20:38:52.370616999Z stderr F {"level":"error","ts":"2024-04-16T20:38:52.370Z","msg":"Error","controller":"shoot","object":{"name":"e2e-rotate","namespace":"garden-local"},"namespace":"garden-local","name":"e2e-rotate","reconcileID":"4a09b9c2-0273-4f05-b899-a231bb2821ff","operation":"reconcile","flow":"Shoot cluster reconciliation","task":"Labeling resources after modification of encryption config or to encrypt them with new ETCD encryption key","error":"retry failed with context deadline exceeded, last error: error discovering server preferred resources: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request","stacktrace":"github.com/gardener/gardener/pkg/utils/flow.(*execution).runNode.func2\n\tgithub.com/gardener/gardener/pkg/utils/flow/flow.go:241"}

I could reproduce this locally:

$ k get apiservice v1beta1.metrics.k8s.io -oyaml
...
status:
  conditions:
  - lastTransitionTime: "2024-04-17T06:28:21Z"
    message: 'failing or missing response from https://10.3.1.18:8443/apis/metrics.k8s.io/v1beta1:
      Get "https://10.3.1.18:8443/apis/metrics.k8s.io/v1beta1": dial tcp 10.3.1.18:8443:
      connect: connection refused'
    reason: FailedDiscoveryCheck
    status: "False"
    type: Available

This was due to missing routes for the shoot networks in the kube-apiserver:

$ k -n shoot--local--e2e-rotate exec -it kube-apiserver-744ccc7b7c-cmcmc -c vpn-path-controller -- sh
~ # ip r
default via 169.254.1.1 dev eth0
10.5.0.0/26 dev tap0 proto kernel scope link src 10.5.0.8
10.5.0.64/26 dev tap1 proto kernel scope link src 10.5.0.72
10.5.0.192/26 dev bond0 proto kernel scope link src 10.5.0.214
169.254.1.1 dev eth0 scope link

$ k -n shoot--local--e2e-rotate logs kube-apiserver-744ccc7b7c-cmcmc vpn-path-controller
[Wed Apr 17 06:43:49 UTC 2024]: 194=err 195=err using
[Wed Apr 17 06:43:53 UTC 2024]: 194=err 195=err using

This in turn was due to the bonding device on the vpn-shoot-client side not using an IP from the configured VPN network but from the default network:

$ ks exec -it vpn-shoot-1 -c vpn-shoot-s1 -- sh
~ # ip a s dev bond0
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 56:1d:92:09:22:d6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.123.194/26 scope global bond0
       valid_lft forever preferred_lft forever
    inet6 fe80::541d:92ff:fe09:22d6/64 scope link
       valid_lft forever preferred_lft forever

The bonding device was incorrectly configured because the VPN_NETWORK env var was missing in the vpn-shoot-init container of the vpn-shoot.
This is fixed in the latest commit.

@timebertt
Copy link
Member Author

/cc @axel7born @ScheererJ

pkg/apis/core/validation/seed_test.go Show resolved Hide resolved
pkg/apis/core/validation/seed.go Outdated Show resolved Hide resolved
pkg/apis/core/validation/seed.go Outdated Show resolved Hide resolved
pkg/apis/core/validation/seed.go Outdated Show resolved Hide resolved
pkg/apis/core/validation/seed_test.go Outdated Show resolved Hide resolved
pkg/gardenlet/operation/botanist/vpnshoot.go Outdated Show resolved Hide resolved
pkg/scheduler/controller/shoot/reconciler_test.go Outdated Show resolved Hide resolved
pkg/utils/validation/cidr/disjoint.go Outdated Show resolved Hide resolved
@axel7born
Copy link
Contributor

axel7born commented Apr 18, 2024

Thank you for your change.
To me it looks good.
I'm not so sure about the vpn vs VPN.

Also, I tested the change with a non-default range in the extension setup with HA VPN without issues.

@timebertt
Copy link
Member Author

Thanks @axel7born for your review. I addressed the suggestions and capitalized VPN consistently across the entire codebase.
PTAL :)

@axel7born
Copy link
Contributor

/lgtm

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Apr 19, 2024
@timebertt
Copy link
Member Author

/retest

Copy link
Contributor

@ScheererJ ScheererJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the enhancement. I have a few questions/comments.

Comment on lines +22 to +37
## Configurable VPN Network

The CIDR of the VPN network can be configured using the `Seed.spec.networks.vpn` field.
The field configures the CIDR for all VPN tunnels of shoots running on the given seed.
It defaults to `192.168.123.0/24` for `IPv4` seeds and `fd8f:6d53:b97a:1::/120` for `IPv6` seeds.
The field is mutable and changing it leads to a temporary VPN disconnect during the next reconciliation of all shoots on this seed.

Similar to the other seed networks (see `Seed.spec.networks`), the configured VPN network must not overlap with any other seed network.
Also, shoot networks (see `Shoot.spec.networking`) must not overlap with any seed network including the VPN network.
This is required for packets to be routed without ambiguity in all components of the seed/shoot architecture.
Migrating the control plane of a shoot to a seed cluster with a different VPN network configuration works as long as network disjointedness is still fulfilled.

With the requirement for disjoint seed and shoot networks, Gardener users are limited in choosing network ranges for their shoots.
Gardener operators might want to use coherent ranges for their seed networks including the VPN network to give users more freedom of choice and make the limitations simpler to reason about.
For this, configuring a non-default VPN network on the seed-level allows Gardener operators to use for example the [IANA-Reserved IPv4 Shared Address Space](https://datatracker.ietf.org/doc/html/rfc6598#section-7) (`100.64.0.0/10`) for all seed networks (the ones that users can't choose for their shoots).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the motivation to make the VPN range configurable seems to come from the desire to make as much private IP space as possible available to shoot owners. As the network used here is not routed within the seed/shoot, i.e. there are no packets on the wire with for example 192.168.123.1 as IP addresses, would it not be a better solution to simply switch the VPN to a pure IPv6 tunnel with potentially IPv4 or IPv6 workload to be transferred? Did you consider this approach?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the motivation to make the VPN range configurable seems to come from the desire to make as much private IP space as possible available to shoot owners

Correct.

As the network used here is not routed within the seed/shoot, i.e. there are no packets on the wire with for example 192.168.123.1 as IP addresses, would it not be a better solution to simply switch the VPN to a pure IPv6 tunnel with potentially IPv4 or IPv6 workload to be transferred?

We didn't consider this approach. Maybe we could have considered it if it had been suggested in the proposal issue 😄

Even if there are no packets on the wire with addresses from the VPN network, I assume that there must not be overlaps between seed/shoot networks and the VPN network. Otherwise, the routes will be ambiguous. Do I get this right?
If so, using an IPv6 VPN network might solve the problem for IPv4 shoots (obviously). Nevertheless, choosing a hard-coded IPv6 network will yield the same "problem" that it imposes additional restrictions (apart from seed networks) on what networks can be chosen for shoots. I quote "problem" because it will be less restrictive in IPv6.
I still think operators would like to put all "restricted" networks (seeds and VPN) into the same IP space, even for IPv6.
Feel free to correct me if my naive thinking doesn't make sense 😄

That being said, I would love to continue with the current approach as it's almost complete. I feel like it doesn't bring too much complexity that it's worth going back to the drawing board and implementing a totally different approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please continue with this approach!
Johannes is on vacation this week. But we had a conversation on Friday, where he said this idea should not block this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could have considered it if it had been suggested in the proposal issue 😄

As you know, feedback can come at any point in time. An idea that does not fit to your timeline is not automatically a bad one.

I disagree with your opinion that with IPv6 it would be a problem. If we would use the hard-coded ULA range as done now with IPv6-only clusters we would only have a very unlikely collision if someone else used the same range. (RFC 4193 requires the global id to be random.)
It would even be possible to reserve a separate global range or use some of the special IPv6 ranges to prevent any chance of collision at all. Therefore, I do believe that you can solve the problem of the scarce IPv4 private address space by using IPv6 for the VPN.
Feel free to convince me otherwise, though.

The overlap concerns between seed and shoot clusters with regards to node, pods and services have not a lot to do with the choice of the VPN range. In other words, the seed/shoot disjointness still needs to be ensured regardless what happens with the VPN.

As the VPN range is not routed in the network, I am not sure if operators are that much interested in it. Of course, if it causes other restrictions it may be relevant, but that is again only relevant in the context of scarce IPv4 space.

Feel free to continue with the approach. As @axel7born already mentioned, I have no hard feelings against this approach, but I do feel that we just use another set of band-aid instead of actually solving the problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An idea that does not fit to your timeline is not automatically a bad one.

Definitely not! I didn't want to say so.

I do feel that we just use another set of band-aid instead of actually solving the problem.

Agreed.

TBH, I do find your suggestion compelling but I lack some knowledge to comprehend it fully.
I would like to discuss the details of your idea. A meeting will probably be difficult this week, so I propose taking this to next week's hackathon :)

In the meantime, I will address the remaining comments in this PR so that we have a finished implementation to go back to if we decide on this approach after all.

pkg/apis/core/v1beta1/types_seed.go Outdated Show resolved Hide resolved
pkg/apis/core/validation/seed.go Show resolved Hide resolved
plugin/pkg/shoot/validator/admission.go Show resolved Hide resolved
pkg/gardenlet/operation/botanist/kubeapiserver.go Outdated Show resolved Hide resolved
pkg/gardenlet/operation/botanist/vpnseedserver.go Outdated Show resolved Hide resolved
example/gardener-local/gardenlet/values.yaml Outdated Show resolved Hide resolved
pkg/component/networking/vpn/shoot/shoot.go Outdated Show resolved Hide resolved
@gardener-prow gardener-prow bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 19, 2024
@gardener-prow gardener-prow bot requested a review from axel7born April 19, 2024 14:04
Copy link
Contributor

gardener-prow bot commented Apr 25, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from scheererj. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@timebertt
Copy link
Member Author

@axel7born I addressed all suggestions added by @ScheererJ. Can you take another look?

@gardener-prow gardener-prow bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 30, 2024
@gardener-prow gardener-prow bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 30, 2024
@timebertt
Copy link
Member Author

Rebased and re-generated pkg/apis/core/v1beta1/generated.pb.go to resolve conflicts.

@timebertt
Copy link
Member Author

/hold
We want to discuss the overall approach for solving the motivating problem, see #9597 (comment)

@gardener-prow gardener-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2024
@gardener-ci-robot
Copy link
Contributor

The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:

  • After 15d of inactivity, lifecycle/stale is applied
  • After 15d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 7d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Mark this PR as rotten with /lifecycle rotten
  • Close this PR with /close

/lifecycle stale

@gardener-prow gardener-prow bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 21, 2024
@rfranzke
Copy link
Member

rfranzke commented May 21, 2024

@timebertt
Copy link
Member Author

Definitely! The pure IPv6-based tunnel is the superior solution to the problem.
/close

@gardener-prow gardener-prow bot closed this May 21, 2024
Copy link
Contributor

gardener-prow bot commented May 21, 2024

@timebertt: Closed this PR.

In response to this:

Definitely! The pure IPv6-based tunnel is the superior solution to the problem.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@timebertt timebertt deleted the configurable-vpn-network branch May 21, 2024 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking Networking related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/enhancement Enhancement, improvement, extension lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make VPN Network Range Configurable
5 participants