Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the kubernetes.kubelet.cpuManagerPolicy field immutable #9265

Open
ialidzhikov opened this issue Feb 27, 2024 · 6 comments
Open

Make the kubernetes.kubelet.cpuManagerPolicy field immutable #9265

ialidzhikov opened this issue Feb 27, 2024 · 6 comments
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/bug Bug

Comments

@ialidzhikov
Copy link
Member

How to categorize this issue?

/area quality
/kind bug

What happened:
With @adenitiu and @nickytd we discovered that changing the kubernetes.kubelet.cpuManagerPolicy field breaks kubelet.
Afterwards it cannot start successfully with the error logs:

E0227 12:31:16.556082   73381 kubelet.go:1466] "Failed to start ContainerManager" err="start cpu manager error: could not restore state from checkpoint: configured policy \"none\" differs from state checkpoint policy \"static\", please drain this node and delete the CPU manager checkpoint file \"/var/lib/kubelet/cpu_manager_state\" before restarting Kubelet"
E0227 12:31:16.556064   73381 cpu_manager.go:224] "Could not initialize checkpoint manager, please drain node and remove policy state file" err="could not restore state from checkpoint: configured policy \"none\" differs from state checkpoint policy \"static\", please drain this node and delete the CPU manager checkpoint file \"/var/lib/kubelet/cpu_manager_state\" before restarting Kubelet"

Node events that prove that the kubelet gets constantly restarted:

 % k describe no shoot--foo--bar-worker-z2-5544c-sd4nh

  Normal   Starting                 3m2s                   kubelet                        Starting kubelet.
  Normal   Starting                 2m57s                  kubelet                        Starting kubelet.
  Normal   Starting                 2m51s                  kubelet                        Starting kubelet.
  Normal   Starting                 2m46s                  kubelet                        Starting kubelet.
  Normal   Starting                 2m40s                  kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      2m40s                  kubelet                        invalid capacity 0 on image filesystem
  Warning  kubelet                  2m39s (x78 over 122m)  healthcheck                    Kubelet is unhealthy for more than 1m0s, restarting it. Health check error: Get "http://127.0.0.1:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused
  Normal   Starting                 2m39s                  kubelet                        Starting kubelet.
  Normal   Starting                 2m33s                  kubelet                        Starting kubelet.
  Normal   Starting                 2m27s                  kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      2m27s                  kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 2m22s                  kubelet                        Starting kubelet.
  Normal   Starting                 2m16s                  kubelet                        Starting kubelet.
  Normal   Starting                 2m11s                  kubelet                        Starting kubelet.
  Normal   Starting                 2m5s                   kubelet                        Starting kubelet.
  Normal   Starting                 2m                     kubelet                        Starting kubelet.
  Normal   Starting                 114s                   kubelet                        Starting kubelet.
  Normal   Starting                 109s                   kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      109s                   kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 103s                   kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      103s                   kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 98s                    kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      98s                    kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 92s                    kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      92s                    kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 87s                    kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      87s                    kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 81s                    kubelet                        Starting kubelet.
  Normal   Starting                 76s                    kubelet                        Starting kubelet.
  Normal   Starting                 70s                    kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      70s                    kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 69s                    kubelet                        Starting kubelet.
  Normal   Starting                 63s                    kubelet                        Starting kubelet.
  Normal   Starting                 57s                    kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      57s                    kubelet                        invalid capacity 0 on image filesystem
  Warning  FailedNetworkChecks      55s (x27 over 101m)    network-problem-detector-host  host network problems for jobID/destination combinations: tcp-n2p/shoot--hc-cc-us1--prod-cc-haas-hana-z1-549ff-7wlvp
  Normal   Starting                 52s                    kubelet                        Starting kubelet.
  Normal   Starting                 46s                    kubelet                        Starting kubelet.
  Normal   Starting                 41s                    kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      41s                    kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 35s                    kubelet                        Starting kubelet.
  Normal   Starting                 30s                    kubelet                        Starting kubelet.
  Normal   Starting                 24s                    kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      24s                    kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 19s                    kubelet                        Starting kubelet.
  Normal   Starting                 13s                    kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      13s                    kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 8s                     kubelet                        Starting kubelet.
  Normal   Starting                 2s                     kubelet                        Starting kubelet.

What you expected to happen:
The kubernetes.kubelet.cpuManagerPolicy field to be immutable.

How to reproduce it (as minimally and precisely as possible):

  1. Create a worker pool with kubernetes.kubelet.cpuManagerPolicy=static.

  2. Change the kubernetes.kubelet.cpuManagerPolicy field to none

  3. Make sure that kubelet is failing to start with:

E0227 12:31:16.556082   73381 kubelet.go:1466] "Failed to start ContainerManager" err="start cpu manager error: could not restore state from checkpoint: configured policy \"none\" differs from state checkpoint policy \"static\", please drain this node and delete the CPU manager checkpoint file \"/var/lib/kubelet/cpu_manager_state\" before restarting Kubelet"
E0227 12:31:16.556064   73381 cpu_manager.go:224] "Could not initialize checkpoint manager, please drain node and remove policy state file" err="could not restore state from checkpoint: configured policy \"none\" differs from state checkpoint policy \"static\", please drain this node and delete the CPU manager checkpoint file \"/var/lib/kubelet/cpu_manager_state\" before restarting Kubelet"

Anything else we need to know?:

Environment:

  • Gardener version: v1.88.0
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:
@gardener-prow gardener-prow bot added area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/bug Bug labels Feb 27, 2024
@ialidzhikov
Copy link
Member Author

/assign

@syy6
Copy link

syy6 commented Feb 28, 2024

Hi @ialidzhikov, actually we still have the need to change it, make the field immutable would cause some other issue for us. A better way would be, if the worker group's node size is 0, then we can change it, or else we can't.

@rfranzke
Copy link
Member

What about triggering a rolling update of the nodes when this field is changed?

@syy6
Copy link

syy6 commented May 2, 2024

Hi @ialidzhikov, if there is any update for this issue? Thanks!

@ialidzhikov
Copy link
Member Author

Sorry, I won't have capacity in the next weeks to look into this issue due to other priorities.

/unassign

@rfranzke
Copy link
Member

rfranzke commented May 8, 2024

In case we would like to pursue the node rolling approach, I assume we have to wait for #9699 first.

cc @MichaelEischer @timebertt @kon-angelo - perhaps you want to consider this cpuManagerPolicy field in the new "hash function" right away?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/bug Bug
Projects
None yet
Development

No branches or pull requests

3 participants