Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet fails to authenticate to apiserver due to expired certificate #65991

Closed
tshaynik opened this issue Jul 9, 2018 · 24 comments
Closed

Kubelet fails to authenticate to apiserver due to expired certificate #65991

tshaynik opened this issue Jul 9, 2018 · 24 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/auth Categorizes an issue or PR as relevant to SIG Auth.

Comments

@tshaynik
Copy link

tshaynik commented Jul 9, 2018

/kind bug
/sig auth

What happened:
My team is having an issue with TLS bootstrap, running Kubernetes 1.10.5. We set --experimental-cluster-signing-duration to 24h on the kube-controller-manager. Some nodes are being deallocated over night, and when they come up, Kubelet goes into a failed state. It appears that it recognizes that the certificate expires and attempts to bootstrap using the token from bootstrap.kubeconfig (so far so good), but then reuses the expired certificate, and cannot authenticate to the apiserver. Here are relevant logs from kubelet:

bootstrap.go:204] Part of the existing bootstrap client certificate is expired: 2018-07-06 12:32:00 +0000 UTC                                                                                                                                                                                              
bootstrap.go:58] Using bootstrap kubeconfig to generate TLS client cert, key and kubeconfig file                                                                                                                                                                                                          
certificate_store.go:117] Loading cert/key pair from "/var/lib/kubelet/pki/kubelet-client-current.pem".                                                                                                                                                                                                    
server.go:549] Starting client certificate rotation.                                                                                                                                                                                                                                                      
certificate_manager.go:216] Certificate rotation is enabled.                                                                                                                                                                                                                                              
certificate_manager.go:287] Rotating certificates                                                                                                                                                                                                                                                          
manager.go:154] cAdvisor running in container: "/sys/fs/cgroup/cpu,cpuacct/system.slice/kubelet.service"                                                                                                                                                                                                  
certificate_manager.go:299] Failed while requesting a signed certificate from the master: cannot create certificate signing request: certificatesigningrequests.certificates.k8s.io is forbidden: User "system:anonymous" cannot create certificatesigningrequests.certificates.k8s.io at the cluster scope

After removing the generated cert at /var/lib/kubelet/pki/kubelet-client-current.pem, kubelet was able to bootstrap properly, obtain a new cert and join the cluster.

rm /var/lib/kubelet/pki/kubelet-client-current.pem
systemctl restart kubelet

Just removing the kubeconfig, or any of the files in /var/lib/kubelet/pki/ other than kubelet-client-current.pem and restarting kubelet did not work. Removing the entire /var/lib/kubelet/pki/ directory and restarting kubelet works as well.

What you expected to happen:
I expect that after kubelet recognizes that its certificate has expired, it should remove its certificate and successfully bootstrap with the token in bootstrap.kubeconfig. It should obtain a new, valid, signed certificate from the control plane and successfully authenticate to the apiserver.

How to reproduce it (as minimally and precisely as possible):

  1. Set the RotateKubeletClientCertificate flag on kubelet and feature gate on kube-controller manager
  2. Set the --experimental-cluster-signing-duration flag on the kube-controller-manager to a small duration.
  3. Start kubelet with bootstrap.kubeconfig file containing a token (that is present in the token authentication file passed to the apiserver) -- kubelet bootstraps successfully and is Ready.
  4. Stop kubelet before it has attempted to renew its certificate.
  5. Wait until kubelet's certificate has expired
  6. Restart kubelet

Anything else we need to know?:
Let me know if there are more logs and information that would be useful. Thanks a lot!

Environment:

  • Kubernetes version (use kubectl version): 1.10.5
  • Cloud provider or hardware configuration: Azure
  • OS (e.g. from /etc/os-release): CentOS 7.4
  • Kernel (e.g. uname -a): 3.10.0-693.11.6.el7.x86_64
  • Install tools: custom
  • Others:
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. sig/auth Categorizes an issue or PR as relevant to SIG Auth. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 9, 2018
@liggitt
Copy link
Member

liggitt commented Jul 9, 2018

/cc @smarterclayton @mikedanese @awly

@smarterclayton
Copy link
Contributor

I'm sad, because I thought this was fixed.

/assign

@mikedanese
Copy link
Member

At a certain point we talked about a mode for the CertificateManager where it always uses it's bootstrap kubeconfig to request a certificate. This is the type of pickle I hoped to avoid in that mode. I'm worried that the current method is too fragile to be used practically with small rotation periods.

@djsly
Copy link
Contributor

djsly commented Jul 16, 2018

Hello, we are currently setting cert rotation to be happening every 24h.

Our users are deallocating their VM overnight to reduce the cloud cost.

Anything we could look into as a quick fix? Or is the current recommendation not to use Cert Rotation until it moves out of Beta?

thanks!

@djsly
Copy link
Contributor

djsly commented Jul 20, 2018

@mikedanese / @smarterclayton your input would be appreciated :)

@xjantoth
Copy link

xjantoth commented Aug 1, 2018

I am trying to do the same:

  1. In Kuberentes v1.11.0 rotateCertificates: true comes out of the box:
kubeadm config view
...
rotateCertificates: true
...
  1. Set --feature-gates
    vim vim /etc/sysconfig/kubelet
KUBELET_EXTRA_ARGS=--fail-swap-on=false --feature-gates=RotateKubeletClientCertificate=true,RotateKubeletServerCertificate=true
  1. but I how to set this flag --experimental-cluster-signing-duration when kubeadm starts?
    We want to make sure that out K8s cluster will be up after one year.

Could you guys suggest what is the right aproache ?
thanks!

@awslovato
Copy link

How does one go about rotating the actual apiserver-kubelet-client.crt? The API server wont start on my cluster because it was created as the sametime as the kubelet.crt.

@mythi
Copy link
Contributor

mythi commented Aug 13, 2018

certificate_manager.go:299] Failed while requesting a signed certificate from the master: cannot create certificate signing request: certificatesigningrequests.certificates.k8s.io is forbidden: User "system:anonymous" cannot create certificatesigningrequests.certificates.k8s.io at the cluster scope

Anonymous requesting CSRs looks strange?

@mythi
Copy link
Contributor

mythi commented Aug 16, 2018

I've spent some free cycles on this and it's seems to be working OK for me on a kubeadm built cluster with kubelet version v1.11.2.

What does this mean:

Install tools: custom

I've been using --experimental-cluster-signing-duration=10m and bootstrap tokens with different --ttl to kubeadm token

@djsly
Copy link
Contributor

djsly commented Aug 16, 2018 via email

@mythi
Copy link
Contributor

mythi commented Aug 16, 2018

I had just systemctl stop kubelet for 10+mins. If the bootstrap token gets expired too, bootstrapping fails but kubelet is automatically able to recover with a new token changed in my bootstrap-kubelet.conf

@djsly
Copy link
Contributor

djsly commented Aug 16, 2018

ok, that's weird. maybe something new in v.1.11.2. I will let @tshaynik verify and reply since he's the one who did most of this work and investigation on our side.

Regarding

Install tools: custom

We install Kubernetes the hard way using a salt based solution.

@mythi
Copy link
Contributor

mythi commented Aug 17, 2018

We install Kubernetes the hard way using a salt based solution.

OK, I'm not familiar with this.

I got almost the same error message you're seeing if I drop RBAC --authorization-mode from the api-server.

@AlbertoPeon
Copy link

@awslovato did you manage to recover your apiserver? if so, can you share how?

I have the same situation, the apiserver won't run due to the certificate being expired and the certificate cannot be renewed due to the apiserver being down :(

@awslovato
Copy link

@awslovato did you manage to recover your apiserver? if so, can you share how?

I have the same situation, the apiserver won't run due to the certificate being expired and the certificate cannot be renewed due to the apiserver being down :(

Hi Albert, I did! On the Kube Slack I found out about using kubeadm to recycle the certs that were expired with kubeadm alpha phase certs apiserver. I did move the old certs though, as kubeadm will not work if it sees existing certs as per:
kubernetes/kubeadm#649

Now my issue is that my users filled the Docker volume. I have to start all over now, and I'm not keen on that.

@joberget
Copy link

Any update on this bug? We are not using kubeadm. Only solution is to manually remove the certs.

@vistalba
Copy link

I run in this problem too:

~# kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-24T06:51:33Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}

Output from journalctl -u kubelet says:

Nov 21 11:03:37 srv1 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Nov 21 11:03:37 srv1 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 38.
Nov 21 11:03:37 srv1 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Nov 21 11:03:37 srv1 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Nov 21 11:03:37 srv1 kubelet[4831]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kub
Nov 21 11:03:37 srv1 kubelet[4831]: Flag --resolv-conf has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubel
Nov 21 11:03:37 srv1 kubelet[4831]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kub
Nov 21 11:03:37 srv1 kubelet[4831]: Flag --resolv-conf has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubel
Nov 21 11:03:37 srv1 kubelet[4831]: I1121 11:03:37.234691    4831 server.go:408] Version: v1.12.2
Nov 21 11:03:37 srv1 kubelet[4831]: I1121 11:03:37.235049    4831 plugins.go:99] No cloud provider specified.
Nov 21 11:03:37 srv1 kubelet[4831]: E1121 11:03:37.236862    4831 bootstrap.go:205] Part of the existing bootstrap client certificate is expired: 2018-11-19 10:41:48 +0000 UTC
Nov 21 11:03:37 srv1 kubelet[4831]: F1121 11:03:37.237238    4831 server.go:262] failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory
Nov 21 11:03:37 srv1 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Nov 21 11:03:37 srv1 systemd[1]: kubelet.service: Failed with result 'exit-code'.

Is there any solution? My "pods" are still running... but I can't use kubectl and if i restart my server, it won't start again :( (so i reverted to a snapshot)

@mosaicwang
Copy link

maybe kubernetes/kubeadm#581 (comment) can solve your problem.

@liggitt
Copy link
Member

liggitt commented Jan 2, 2019

For the original issue,
3689593 was added to resolve that in 1.11

@vistalba if both the normal kubelet client certificate and the bootstrap client credential are expired, going back to your setup method to recreate a credential for the kubelet is the best recommendation

/close

@k8s-ci-robot
Copy link
Contributor

@liggitt: Closing this issue.

In response to this:

For the original issue,
3689593 was added to resolve that in 1.11

@vistalba if both the normal kubelet client certificate and the bootstrap client credential are expired, going back to your setup method to recreate a credential for the kubelet is the best recommendation

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@SaltedEggIndomee
Copy link

SaltedEggIndomee commented Apr 25, 2019

How is this resolved? I'm still experiencing this exact error on Kubernetes 1.12.7 running on EKS.

server.go:262] failed to run Kubelet: failed to create kubelet: failed to initialize certificate manager: failed to initialize server certificate manager: could not decode the first block from "/var/lib/kubelet/pki/kubelet-server-current.pem" from expected PEM format

Upon checking, the kubelet-server-current.pem points to an empty file. Then kubelet will fail to start.

Workaround :

  1. rm /var/lib/kubelet/pki/kubelet-server-current.pem
  2. systemctl restart kubelet

Surprisingly this only happens on 1 specific Node. All the Nodes are deployed using the same configuration.

@awly
Copy link
Contributor

awly commented Apr 25, 2019

@SaltedEggIndomee your error message complains about kubelet serving certificate while your workaround removes the client one.
Is that a typo in workaround steps?

In any case, serving certificate being empty and preventing kubelet startup is a separate issue, this one is about client certificates. Please file a new issue.

@gdoctor
Copy link

gdoctor commented May 6, 2019

I can see a very similar thing happening in 1.14.x. I joined a 1.14.x node to a cluster, which completes the TLS bootstrapping process successfully, and stores current, signed certs at /var/lib/kubelet/pki/kubelet-client-current.pem.

I then delete the created /etc/kuberenetes/kubelet.conf, and join the node to a different 1.14.x cluster moments later. The kubelet is failing to complete the TLS bootstrapping process for the new cluster. But by also deleting /var/lib/kubelet/pki/kubelet-client-current.pem before joining the new cluster, the whole process completes as intended.

Perhaps someone with more knowledge on this process can shed some light into the reasoning here. Btw I did not experience this with 1.13.x.

@RahulMahale
Copy link

I am on baremetal kubernetes version 1.15.3 and below steps helped me solve the issue.

  1. SSH to one of the master node.
  2. create a token using the command kubeadm token create --ttl 24h0m0s
  3. Capture the output(token) on from the step number 2.
  4. SSH to the worker(kubelet) node which is having issues to connect to API server.
  5. Replace the token from the step number 2 in the file /etc/kubernetes/bootstrap-kubelet.conf
  6. Restart the kubelet process using sudo service kubelet restart It will generate the new kubelet.conf and attach the kubelet to the cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/auth Categorizes an issue or PR as relevant to SIG Auth.
Projects
None yet
Development

No branches or pull requests