Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework OperatingSystemConfigKey and WorkerPoolHash to allow considering kubeReserved #9699

Open
MichaelEischer opened this issue May 2, 2024 · 1 comment
Labels
area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension

Comments

@MichaelEischer
Copy link
Contributor

How to categorize this issue?

/area robustness
/kind enhancement

Suggested approach for implementing the "Rolling update of the worker pool when critical kubelet configuration changed" step from #2590 .

Table of Contents

Summary

To roll worker node pools if resource reservations managed via kubeReserved change, it becomes necessary to version the calculation of the OperatingSystemConfig key and also the WorkerPoolHash. This ensures that worker pools are only rolled when actually changing kubeReserved and not unnecessarily once kubeReserved starts to be considered for node rolls.

Motivation

Changes of kubeReserved for existing clusters currently happen in-place. They are applied by restarting the kubelet on each node with the new resource reservations. This can cause immediate preemptions on already loaded nodes. In particular, PodDisruptionsBudgets are not consider, which can lead to workload disruptions. To upgrade existing workload to new node resource reservations with minimal disruptions, we want to roll the worker nodes and use the updated reservations only on new nodes. This requires rolling the worker pool and switching to a new OperatingSystemConfig (OSC), which includes the kubeReserved value. The new OSC must use a different name to prevent already existing nodes from applying the new kubeReserved values.

#2590 introduces a new way to calculate default kubeReserved values. Upgrading to these new resource reservations with minimal disruptions requires the previously mentioned mechanism. However, the first attempt in #9465 was unable to handle the initial rollout without disruptions.

Problem

Worker pool rolls are triggered if the WorkerPoolHash changes. To consider new fields in the WorkerPoolHash, the current approach is to add a new optional field to the extensionsv1alpha1.Worker objects. The field is then only included in the WorkerPoolHash if it is set. Thereby, a node pool roll is only triggered by using the new feature/field.

A worker pool must only be rolled if required by changed settings of the worker pool, that is it MUST NOT roll unnecessarily when upgrading the WorkerPoolHash calculation.

The optional field approach does not work for kubeReserved as it already has a value that may differ from the static defaults used by Gardener. Thus, kubeReserved always has a value and including it in the WorkerPoolHash would trigger an immediate node roll.

In addition, the OperatingSystemConfig key (OSCKey) must also change to ensure that only new workers pick up the new configuration. Currently, this requires manually keeping both the OSCKey and the WorkerPoolHash in sync, such that each change of the OSCKey also coincides with a node pool roll. Instead the WorkerPoolHash should include an OSC-specific hash as input to trigger a node roll when the OSC key changes.

As the OSC key must also change if kubeReserved changes, this shifts the problem of keeping the WorkerPoolHash stable to keeping the OSC key stable.

Goals

  • Extract provider-independent attributes from the WorkerPoolHash calculation to gardenlet.
  • Trigger a node roll if kubeReserved changes.
  • But do not roll all nodes when initially rolling out the new hash calculation.

Non-Goals

  • Introduce a pattern that extensions can follow to handle new node roll triggers in their providerConfig similar to kubeReserved.

Proposal

The central idea is to version the WorkerPoolHash and the OSCKey calculation. Already existing worker pools and OSCs must stick to the old hash version. If kubeReserved changes then the worker pool should be upgraded to the new hash version. The necessary state to track the used hash version is stored in a single secret for each shoot.

As the Worker configuration and therefore the WorkerPoolHash are tied to a specific OSC, we'll start with discussing the OSCKey calculation and versioning.

OSCKey Hash Calculation

We propose to provide two OSCKey hash versions:

  • Version 1 (current behavior): Calculate hash based on worker.Name, minorKubernetesVersion, worker.CRI and worker.Machine.Image.Name. The resulting value must be identical to the current result.
  • Version 2: use format gardener-node-agent-<worker.Name>-hash(worker.CRI, machineType, volume type+size, worker.Machine.Image.Name+Version, minorKubernetesVersion, credentialsRotationStatus, nodeLocalDNS, kubeReserved)[:16]-<suffix>
    • This includes all provider-independent node roll triggers that were previously included in the WorkerPoolHash.
    • Note: image name is not included in the OSC name anymore.
    • Maximum length: 61 (worker.Name limited to 15 characters, suffix is at most 8)

OSCKey Versioning

gardenlet stores a secret called pool-hashes in the shoot namespace of the hosting seed. The secret contains the field data, which for each pool contains the used OSCKey hash version and stores the values calculated using the current and latest OSCKey hash version supported by Gardener.

kind: Secret
metadata:
  name: pool-hashes
  namespace: shoot--project--shootname
  labels:
    persist: "true" # -> store and migrate during control plane migration
stringData:
  data: |
    pools:
    - name: a
      currentVersion: 1
      hashes:
        "1": fede
        "2": abcd
    - name: b
      currentVersion: 2
      hashes:
        "2": dada

The secret is read by gardenlet while reconciling OSCs for a shoot and is updated before writing the updated OSCs. The secret includes an entry for each worker pool in the shoot, worker pools are matched according to their name. An individual entry is updated as follows:

  • If no entry exists for a worker pool or the secret as a whole does not exist, then create a new entry that uses the latest hash version as currentVersion.
  • Calculate the current hash value for each hash version included in the hashes field. If any of those hash values changes, then set currentVersion to the latest supported version.
  • Update the hashes field to include the calculated hash value using the currentVersion and the latest version supported by Gardener. Remove hashes for other versions.

Currently, secrets with the persist label must also be labeled with managed-by: secrets-manager to be migrated during the control plane migration. To migrate the pool-hashes secret, the current managed-by: secrets-manager filter must be removed from computeSecretsToPersist.

For the initial rollout of this secret, on startup gardenlet creates pool-hashes secrets for each shoot based on the currently existing worker pools in the shoot spec. For each worker pool, only the name field is included and currentVersion is set to 1. The hashes field is not set. The next OSC reconcile will add the missing hash values.

The rationale for the fields is as follows:

  • kubeReserved is a property of each worker pool and thus must be stored at this granularity.
  • The currentVersion of the hash must be stored to prevent unnecessary changes of the OSCKey.
  • The previous hashes must be stored to allow fields that are only included in a new hash version to trigger a node roll. For example, kubeReserved is only included in hash version 2. However, changing the value should nevertheless trigger a hash version upgrade along with a node roll. A change of kubeReserved can only be considered by storing the hash (or its underlying informatino) when calculated using version 2.
  • When introducing a new hash version, the missing hashes are only added during OSC reconciliation. Consequently, changes to fields that are only included in the new hash, will only trigger a node roll after the first successful OSC reconciliation.
  • The secret is marked with "persist" to ensure that it is migrated during a control plane migration.

WorkerPoolHash

The WorkerPool of an extensionsv1alpha1.Worker is extended with an oscHash field. This field is set to the current hash value of the corresponding OSC, unless the OSC still uses hash version 1.

The WorkerPoolHash calculation works differently depending on whether oscHash is set or not.

  • oscHash is empty: continue using the current WorkerPoolHash calculation.
    • This includes all provider-independent node roll triggers (as before).
  • oscHash is set: the WorkerPoolHash calculation only uses the oscHash and provider-extension--specific additional fields as input. The latter have to explicitly be passed in by the extension, the raw value of workerPool.ProviderConfig.Raw is no longer added to the hash.
    • Previously used fields like the Kubernetes minor version are already covered by the oscHash.

The OSC for previously existing worker pools uses hash version 1. Thereby, the WorkerPoolHash remains unchanged when initially rolling out this change.

apiVersion: extensions.gardener.cloud/v1alpha1
kind: Worker
metadata:
  name: example
spec:
  pools:
  - name: "a"
    oscHash: "fede" # gardener hash part (same value as OSC name hash), empty if v1 is used for OSC
    kubernetesVersion: 1.28.9
    machineImage:
      name: coreos
      version: 3815.2.1
    kubeReserved:
      cpu: 80m
      memory: 8Gi

Removal of Legacy Hashes

Legacy hash versions can only be removed once we can guarantee that there are no more users. The only way to ensure that is by waiting until all currently supported Kubernetes versions are no longer supported by Gardener. Then it is guaranteed that a node roll has happend since introducing the new hash version and thereby the hash version of all OSCs has been upgraded.

OSCKey Label for Shoots

The shoot health checks in botanist currently have to calculate the OSCKey based on information annotated at each node. This will no longer work with the aforementioned changes. As a replacement, each node is labelled with worker.gardener.cloud/operatingsystemconfig that contains the name of the corresponding OSC. Thereby the health checks no longer require knowledge how to calculate the OSC name/key.

The label is included in the Worker extension object and therefore will be added to all nodes on the next reconciliation. For a smooth migration, the health check initially has to fall back to the current approach of calculating the OSCKey itself. This fallback can be removed after a transition periods of a few Gardener versions.

Alternatives

  • Add a flag for each worker pool that tracks whether kubeReserved still uses the default value (ignored by the WorkerPoolHash calculation). This is rather ugly as it requires keeping an additional field for each worker pool.
  • Use the Gardener Node Agent to upgrade kubeReserved in-place. Changing kubeReserved requires a restart of kubelet and results in immediate preemptions of pods if not enough resources are available. Existing mechanisms like maxSurge or PDBs would be ignored.
  • Only include kubeReserved in the WorkerPoolHash starting from K8s >= 1.30. This would take more than a year to roll out this change to all clusters.

Implementation Steps

Draft:

  • Label nodes with OSCKey
  • Implement the OSCKey versioning with the pool-hashes secret, but only implement version 1 of the hash
  • Implement hash version 2 along with the new WorkerPoolHash
  • Bump Gardener version in all provider extensions
@gardener-prow gardener-prow bot added area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension labels May 2, 2024
@timebertt
Copy link
Member

cc @rfranzke @kon-angelo
This should reflect the results of our discussions :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension
Projects
None yet
Development

No branches or pull requests

2 participants