Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm reset success but this node ip still in kubeadm-config configmap #1300

Closed
pytimer opened this issue Dec 5, 2018 · 32 comments · Fixed by kubernetes/kubernetes#71945
Closed
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@pytimer
Copy link

pytimer commented Dec 5, 2018

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version):

[root@k8s-211 ~]# kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0", GitCommit:"ddf47ac13c1a9483ea035a79cd7c10005ff21a6d", GitTreeState:"clean", BuildDate:"2018-12-03T21:02:01Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Kubernetes version (use kubectl version):
[root@k8s-211 ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0", GitCommit:"ddf47ac13c1a9483ea035a79cd7c10005ff21a6d", GitTreeState:"clean", BuildDate:"2018-12-03T21:04:45Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0", GitCommit:"ddf47ac13c1a9483ea035a79cd7c10005ff21a6d", GitTreeState:"clean", BuildDate:"2018-12-03T20:56:12Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
  • Kernel (e.g. uname -a):
Linux k8s-lixin-211 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
  • Others:

What happened?

I use kubeadm reset -f to reset this control plane node, the command run success. But when i see kubeadm-config ConfigMap, it already have this node ip in ClusterStatus.

I still have a question, why kubeadm reset not delete this node directly from the cluster? Instead, run kubectl delete node <node name> manually.

What you expected to happen?

kubeadm-config ConfigMap remove this node ip.

How to reproduce it (as minimally and precisely as possible)?

  • kubeadm init --config=kubeadm.yml on the first node.
  • kubeadm join --experimental-control-plane --config=kubeadm.yml on the second node.
  • kubeadm reset -f on the second node.
  • kubectl -n kube-system get cm kubeadm-config -oyaml find the second node ip already in ClusterStatus.

Anything else we need to know?

kubeadm-config configMap yaml:
apiVersion: v1
data:
  ClusterConfiguration: |
    apiServer:
      extraArgs:
        authorization-mode: Node,RBAC
      timeoutForControlPlane: 4m0s
    apiVersion: kubeadm.k8s.io/v1beta1
    certificatesDir: /etc/kubernetes/pki
    clusterName: kubernetes
    controlPlaneEndpoint: 192.168.46.117:6443
    controllerManager: {}
    dns:
      type: CoreDNS
    etcd:
      local:
        dataDir: /var/lib/etcd
        extraArgs:
          cipher-suites: TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
        serverCertSANs:
        - 192.168.46.117
    imageRepository: k8s.gcr.io
    kind: ClusterConfiguration
    kubernetesVersion: v1.13.0
    networking:
      dnsDomain: cluster.local
      podSubnet: 10.244.0.0/16
      serviceSubnet: 10.96.0.0/12
    scheduler: {}
  ClusterStatus: |
    apiEndpoints:
      k8s-211:
        advertiseAddress: 192.168.46.211
        bindPort: 6443
      k8s-212:
        advertiseAddress: 192.168.46.212
        bindPort: 6443
    apiVersion: kubeadm.k8s.io/v1beta1
    kind: ClusterStatus
kind: ConfigMap
metadata:
  creationTimestamp: "2018-12-04T14:17:38Z"
  name: kubeadm-config
  namespace: kube-system
  resourceVersion: "103402"
  selfLink: /api/v1/namespaces/kube-system/configmaps/kubeadm-config
  uid: 5a9320c1-f7cf-11e8-868d-0050568863b3
@neolit123
Copy link
Member

cc @fabriziopandini

@neolit123 neolit123 added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Dec 5, 2018
@danbeaulieu
Copy link

Ideally there would be a way to "refresh" the ClusterStatus. We run clusters with chaos testing, it is entirely possible for a control plane node to be terminated without warning and without opportunity to run kubeadm reset. Ideally there would be a clean way to update the ClusterStatus explicitly to remove control plane nodes we know are no longer in the cluster. This is something that'd be done before running kubeadm join --control-plane ... or possibly it is built in?

@fabriziopandini
Copy link
Member

Some comments here:

kubeadm-config ConfigMap remove this node ip.

@pytimer I know that having a node api address left in cluster status is not ideal, but I'm interested in understanding if this "lack of cleanup" generates problems or not. Have you tried to join the same control-plane again? have you tried to join another control-plane? I don't expect problems, but getting a confirmation on this point will really be appreciated.

I still have a question, why kubeadm reset not delete this node directly from the cluster? Instead, run kubectl delete node manually.

@luxas might be a little bit of historical context can help here.
My guessing is that the node didn't have the privilege to delete themselves (but this apply to worker nodes, not to control-plane nodes...)

Ideally there would be a way to "refresh" the ClusterStatus / there would be a clean way to update the ClusterStatus explicitly

@danbeaulieu that's a good point. having an explicit command for syncing cluster status and or enforcing automatic sync when kubeadm is executed is a good idea.
However, being kubeadm without any kind of continuosly running control loop, I think that there will be always the possibility to have ClusterStatus out of sync.
This should not be a problem, or more in specifically having node ip for nodes not existing anymore (lack of cleanup) should not be a problem.
Instead if a node exists and the corresponding node ip is missing from ClusterStatus (wrong initialization) this could create problems e.g. for updates.

Could you kindly report if the above assumption are confirmed by your chaos testing? any feedback will be really appreaciated.

@pytimer
Copy link
Author

pytimer commented Dec 7, 2018

@fabriziopandini I join the same control-plane node, it's failed.

My join steps:

The second control-plane node ip is 192.168.46.212.

  • remove the 192.168.46.212 node etcd member from etcd cluster.
  • kubectl delete node k8s-212
  • kubeadm reset -f on this control-plane node.
  • run kubeadm join --experimental-control-plane --config kubeadm.yaml -v 5 again.

kubeadm join logs :

...
[etcd] Checking Etcd cluster health
I1207 17:57:18.109993    8541 local.go:66] creating etcd client that connects to etcd pods
I1207 17:57:18.110000    8541 etcd.go:134] checking etcd manifest
I1207 17:57:18.119797    8541 etcd.go:181] etcd endpoints read from pods: https://192.168.46.211:2379,https://192.168.46.212:2379
I1207 17:57:18.131111    8541 etcd.go:221] etcd endpoints read from etcd: https://192.168.46.211:2379
etcd cluster is not healthy: context deadline exceeded

I see the kubeadm code, and i think this problem maybe caused by 192.168.46.212 left in the kubeadm-config ConfigMap.

Kubeadm get api endpoints from kubeadm-config ConfigMap when join control-plane node, and etcd endpoints are the same as api endpoints. But 912.168.46.212 control-plane node has been removed and it has not been joined yet, so check etcd cluster health wrong.

When i remove the 192.168.46.212 api endpoint from the kubeadm-config ConfigMap, and join this control-plane node again, it join success.

@fabriziopandini
Copy link
Member

@pytimer thanks!
This should be investigated. There is already a logic that tries to sync the supposed list of etcd endpoints with the real etcd list endpoint, but something seems not working properly

@fabriziopandini fabriziopandini added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Dec 7, 2018
@danbeaulieu
Copy link

Yes this does seem like a bug. We have a 3 node control plane ASG. If we terminate an instance a new one will be created per the ASG rules. During this time the terminated node is listed as unhealthy in the member list of etcd. When the new instance comes up, before running kubeadm join..., it removes the unhealthy member from etcd. By the time we run kubeadm join... the etcd cluster is healthy with 2 nodes according to etcd. However kubeadm uses the ClusterStatus as its source of truth, which still has the old instance listed.

The workaround for us is right after doing the etcd membership management is to update the kubeadm-config ConfigMap with the truth of the cluster, and then running kubeadm join....

Ideally kubeadm join... would use etcd as the source of truth and update the kubeadm-config ConfigMap accordingly.

@pytimer
Copy link
Author

pytimer commented Dec 11, 2018

@fabianofranz I maybe found the cause of this problem.

When sync the etcd endpoints with the real etcd endpoint list, the sync is success. But assign the real etcd endpoints to etcd client Endpoints, this client variable is not a pointer, so when other code use the client, this client endpoints still old endpoints, not the real endpoints after sync.

I fixed this problem in my fork repository, you can check this PR pytimer/kubernetes@0cdf6ca. And i testing join the same control-plane node user case, it join success.

@fabriziopandini
Copy link
Member

@pytimer Looks great! Well spotted!
Could you kindly send a PR? IMO this will be eligible for cherry picking.

@neolit123 @timothysc ^^^

@pytimer
Copy link
Author

pytimer commented Dec 11, 2018

@fabianofranz The first PR is wrong, i forget confirm CLA.

This PR kubernetes/kubernetes#71945 you can check. If anything wrong, hope you point out.

@zhangyelong
Copy link

@fabriziopandini I join the same control-plane node, it's failed.

My join steps:

The second control-plane node ip is 192.168.46.212.

  • remove the 192.168.46.212 node etcd member from etcd cluster.
  • kubectl delete node k8s-212
  • kubeadm reset -f on this control-plane node.
  • run kubeadm join --experimental-control-plane --config kubeadm.yaml -v 5 again.

kubeadm join logs :

...
[etcd] Checking Etcd cluster health
I1207 17:57:18.109993    8541 local.go:66] creating etcd client that connects to etcd pods
I1207 17:57:18.110000    8541 etcd.go:134] checking etcd manifest
I1207 17:57:18.119797    8541 etcd.go:181] etcd endpoints read from pods: https://192.168.46.211:2379,https://192.168.46.212:2379
I1207 17:57:18.131111    8541 etcd.go:221] etcd endpoints read from etcd: https://192.168.46.211:2379
etcd cluster is not healthy: context deadline exceeded

I see the kubeadm code, and i think this problem maybe caused by 192.168.46.212 left in the kubeadm-config ConfigMap.

Kubeadm get api endpoints from kubeadm-config ConfigMap when join control-plane node, and etcd endpoints are the same as api endpoints. But 912.168.46.212 control-plane node has been removed and it has not been joined yet, so check etcd cluster health wrong.

When i remove the 192.168.46.212 api endpoint from the kubeadm-config ConfigMap, and join this control-plane node again, it join success.

Got same error in kubeadm version 1.13.2, I tried to remove the node manually and update kubeadm-config, it doesn't work, the rest etcd nodes still try to connect the removed node

@lvangool
Copy link

When i remove the 192.168.46.212 api endpoint from the kubeadm-config ConfigMap, and join this control-plane node again, it join success.

@pytimer Can you please elaborate on how you manually removed the old api-server?

I am running 1.13.3; removing the old server manually via:

1. kubectl -n kube-system get cm kubeadm-config -o yaml > /tmp/conf.yml
2. manually edit /tmp/conf.yml to remove the old server
3. kubectl -n kube-system apply -f /tmp/conf.yml 

I'm still not able to join the cluster due to the error:

[etcd] Checking etcd cluster health
etcd cluster is not healthy: context deadline exceeded

I've then killed the api pods and the etcd pods (2 of each).

They get recreated, but I still have the same error when trying to connect the additional node.

@Halytskyi
Copy link

Halytskyi commented Feb 19, 2019

Had the same issue in 1.13.3 (HA cluster setup: 3 master nodes + 3 workers). Successfully replaced master node only after next steps:

Delete node from cluster

kubectl delete node master03

Download etcdctl (for example, on master01)

mkdir /opt/tools && cd /opt/tools
wget https://github.com/etcd-io/etcd/releases/download/v3.3.12/etcd-v3.3.12-linux-arm64.tar.gz
tar xfz etcd-v3.3.12-linux-arm64.tar.gz

Remove master node from etcd

cd /opt/tools/etcd-v3.3.12-linux-arm64
./etcdctl --endpoints https://192.168.0.11:2379 --ca-file /etc/kubernetes/pki/etcd/ca.crt --cert-file /etc/kubernetes/pki/etcd/server.crt --key-file /etc/kubernetes/pki/etcd/server.key member list
./etcdctl --endpoints https://192.168.0.11:2379 --ca-file /etc/kubernetes/pki/etcd/ca.crt --cert-file /etc/kubernetes/pki/etcd/server.crt --key-file /etc/kubernetes/pki/etcd/server.key member remove 28a9dabfcfbca673

Remove from kubeadm-config

kubectl -n kube-system get cm kubeadm-config -o yaml > /tmp/conf.yml
manually edit /tmp/conf.yml to remove the old server
kubectl -n kube-system apply -f /tmp/conf.yml

@pytimer
Copy link
Author

pytimer commented Feb 19, 2019

@zhangyelong Now kubeadm reset can't remove the etcd member, so you found the etcd cluster still connect the removed etcd node. You should manually remove etcd member using etcdctl now.

I send a PR to implement remove the etcd node when reset, you can see. kubernetes/kubernetes#74112

@pytimer
Copy link
Author

pytimer commented Feb 19, 2019

@lvangool You can follow @Halytskyi steps. The PR: kubernetes/kubernetes#71945 fixes sync etcd endpoints when join the control plane node, can not remove the etcd member.

Remove the etcd member from the etcd cluster when reset, you can see kubernetes/kubernetes#74112.

@danbeaulieu
Copy link

danbeaulieu commented Mar 1, 2019

This seems to still be a bug in 1.13.4.

We still need to manually update the kubeadm config map ala #1300 (comment)

Is it not the case that the fix in
kubernetes/kubernetes#71945 would use the etcd cluster membership as the source of truth for cluster members? If not, what exactly did that PR fix?

Interestingly it sporadically works because in golang range over maps, like ClusterStatus, is non-deterministic. So if the first endpoint it finds is from an old endpoint that no longer exists, things fail. If it finds a healthy endpoint it will update the ClusterStatus from the etcd Sync...

I believe the root cause of this is a bug in the etcd clientv3 where the bug causes the client not to retry the other endpoints if the first one fails etcd-io/etcd#9949.

@fabriziopandini
Copy link
Member

@danbeaulieu
Copy link

@fabriziopandini There is at least one other issue in here that is unrelated to kubeadm reset.

If a node fails without the chance to perform kubeadm reset (instance termination, hardware failure etc)
The cluster is left in a state where the ClusterStatus.apiEndpoints still lists a node that is no longer in the cluster. This requires the workaround of reading, editing and updating the config map before performing kubeadm join. Kubeadm probably has 2 options:

  1. Implement the etcd client retry itself if the dial fails
  2. Wait for the go-grpc bug to be fixed and then for the fix to make it to etcd client

This issue may be a good issue to use to track either of those options.

@neolit123
Copy link
Member

If a node fails without the chance to perform kubeadm reset (instance termination, hardware failure etc)
The cluster is left in a state where the ClusterStatus.apiEndpoints still lists a node that is no longer in the cluster. This requires the workaround of reading, editing and updating the config map before performing kubeadm join.

that is true, without calling reset you will have to manually update the ClusterStatus.
we don't have a command that does that. if you feel this a feature that kubeadm should support please file a separate ticket.

@falken
Copy link

falken commented Apr 30, 2019

Just experienced this today on 1.14.1

The instance running one of my master nodes failed which kept it from being gracefully removed. When a new node tried to come in it failed to join due to the error described in this ticket.

I had to manually remove the etcd member via etcdctl, then I could join in a new node. I also manually removed the node from the kubeadm-config ConfigMap, but I am not sure if that was required.

@Nurlan199206
Copy link

@Halytskyi Thank you etcdctl section helped for me.....

@zakkg3
Copy link

zakkg3 commented Oct 30, 2019

Experienced this today in 1.15.5

In my case, I joined the cluster but with 1.16 version. then deleted this node kubectl delete node, downgrade to 15.5.5 and try to rejoin (same ip, same hostname, different version) and got the etcd unhealthy error.

Solved by (based on @Halytskyi answer but with updated etcdctl):

  • Delete the node form the kubeadm-config configmap
>: kubectl edit configmap  kubeadm-config -n kube-system
configmap/kubeadm-config edited
  • kubeadm reset -f in the problematic node && iptables -t -f -X and so on.

  • delete etcd member (this is the key):

root@k8s-nebula-m-115-2:wget https://github.com/etcd-io/etcd/releases/download/v3.4.3/etcd-v3.4.3-linux-amd64.tar.gz
root@k8s-nebula-m-115-2:tar xfz etcd-v3.4.3-linux-amd64.tar.gz
root@k8s-nebula-m-115-2:~/etcdctl/etcd-v3.4.3-linux-amd64# ./etcdctl --endpoints https://127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
289ed62da3c6e9e5, started, k8s-nebula-m-115-1, https://10.205.30.2:2380, https://10.205.30.2:2379, false
917e16b9e790c427, started, k8s-nebula-m-115-0, https://10.205.30.1:2380, https://10.205.30.1:2379, false
ad6b76d968b18085, started, k8s-nebula-m-115-2, https://10.205.30.0:2380, https://10.205.30.0:2379, false
root@k8s-nebula-m-115-2:~/etcdctl/etcd-v3.4.3-linux-amd64# ./etcdctl --endpoints https://127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 289ed62da3c6e9e5
Member 289ed62da3c6e9e5 removed from cluster d4913a539ea2384e

And then rejoin works.

@neolit123
Copy link
Member

this can happen if kubeadm reset is interrupted and couldn't delete the node from the kubeadm CM.
in such a case you need to manually delete it from the kubeadm CM.

@zakkg3
Copy link

zakkg3 commented Oct 31, 2019 via email

@neolit123
Copy link
Member

"kubeadm reset" should delete it from the kubeadm CM, but calling "kubectl delete node" is also needed which deletes the Node API object.

@zakkg3
Copy link

zakkg3 commented Oct 31, 2019 via email

@neolit123
Copy link
Member

kubeadm reset should also remove the etcd member from the etcd cluster.
try executing it with e.g. --v=5 and see what it does.

however keep in mind that kubeadm reset is a best effort command so if it fails for some reason it might only print a warning.

@zakkg3
Copy link

zakkg3 commented Oct 31, 2019

so kubectl delete node does not delete it from etcd. Instead, running in the node kubeadm reset does it.
sounds broken to me, I think kubectl delete node should delete it form etcd also. Or am I missing an obvious use case?
maybe asking if it should be also deleted from there?
Anyway thanks for the clarification @neolit123, I first delete it from the control plane and then did a reset, guess it was too late to delete himself from the etcd.

@neolit123
Copy link
Member

neolit123 commented Oct 31, 2019

so there are different responsibilities.
kubectl delete node, deletes the Node API object - you should do this when you are really sure that you no longer want the node around,
before that you should call kubeadm reset on that node. what i does is it cleans the state on disk and also removes the etcd member (if this is a control-plane node and if you are using the default option where etcd instances are running per control-plane node)

kubeadm reset resets the node, but it does not delete the Node object for a couple of reasons:

  • reset just resets the node and you can rejoin it. the Node name remains reserved.
  • the node itself does not have enough privileges to delete it's Node object. this is the responsibility of the owner of the "admin.conf" (e.g. administrator).

@Mikulas
Copy link

Mikulas commented Oct 31, 2019

kubeadm reset is a best effort command

Regarding this: when the kubeadm reset fails to complete for whatever reason (including a hard fail of the underlying server so that kubeadm reset is never executed in the first place) are there any options to manually reconcile the state beside manually editing the kubeadm-config configmap object and removing the node?

@neolit123
Copy link
Member

if the node is hard failed and you cannot call kubeadm reset on it, it requires manual steps. you'd have to :

  1. remove the control-plane IP from the kubeadm-config CM ClusterStatus
  2. remove the etcd member using etcdctl
  3. delete the Node object using kubectl (if you don't want the Node around anymore)

1 and 2 apply only to control-plane nodes.

@amadav
Copy link

amadav commented May 9, 2020

Is there any way to automate this fail-over if kubeadm reset cannot be run?

@thyn
Copy link

thyn commented May 22, 2020

Same problems on 1.9. Thanks for solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet