Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent update of Cluster Add-ons can lead to downtime #35

Open
guettli opened this issue Dec 4, 2023 · 0 comments · May be fixed by #64
Open

Concurrent update of Cluster Add-ons can lead to downtime #35

guettli opened this issue Dec 4, 2023 · 0 comments · May be fixed by #64
Labels
Container Issues or pull requests relevant for Team 2: Container Infra and Tooling epic Issues that are spread across multiple sprints
Milestone

Comments

@guettli
Copy link
Contributor

guettli commented Dec 4, 2023

/kind bug

What steps did you take and what happened:

Up to now, updating the Cluster Add-ons has occurred concurrently with the upgrade of the cluster.

See clusteraddon_controller.go L182-199.

This leaves the outcome to chance. All three outcomes are possible:

  • An Add-on is upgraded before a new Kubernetes version is rolled out.
  • An Add-on is upgraded after a new Kubernetes version is rolled out.
  • An Add-on is upgraded at the same time as the new Kubernetes version.

Case 3 is especially dangerous.

If a fundamental Add-on like the Container-Network-Interface is upgraded during the rollout of a new Kubernetes Version, the reachability between the nodes could be jeopardized.

What did you expect to happen:

As a creator of Cluster-Stacks, I want to configure whether an add-on is upgraded before or after the upgrade of a new Kubernetes version.

As a creator of Cluster-Stacks, I want to configure that add-on X is upgraded (or deleted) after add-on Y.

As a creator of Cluster-Stacks, I want to be able to add additional steps before or after an upgrade. For example, I want to automatically create a backup before the upgrade happens.

Example

The CNI Cilium requires that a pre-flight check be executed before the new Cilium version is applied.

Currently, it is not possible to automate these steps:

  1. Apply the pre-flight check.
  2. Wait until the pre-flight check signals "OK to upgrade."
  3. Apply the upgrade.
  4. Remove the pre-flight check.

Up to now, these steps have required manual work.

Cluster API Lifecycle Hooks

The Cluster API Lifecycle Hooks could be used to implement these steps.

@batistein batistein transferred this issue from SovereignCloudStack/cluster-stacks Dec 4, 2023
@jschoone jschoone added the Container Issues or pull requests relevant for Team 2: Container Infra and Tooling label Dec 12, 2023
@jschoone jschoone added the epic Issues that are spread across multiple sprints label Dec 12, 2023
@jschoone jschoone linked a pull request Feb 8, 2024 that will close this issue
5 tasks
@janiskemper janiskemper linked a pull request Feb 14, 2024 that will close this issue
14 tasks
@janiskemper janiskemper removed a link to a pull request Feb 14, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Container Issues or pull requests relevant for Team 2: Container Infra and Tooling epic Issues that are spread across multiple sprints
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

2 participants