New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm reset success but this node ip still in kubeadm-config configmap #1300
Comments
Ideally there would be a way to "refresh" the ClusterStatus. We run clusters with chaos testing, it is entirely possible for a control plane node to be terminated without warning and without opportunity to run |
Some comments here:
@pytimer I know that having a node api address left in cluster status is not ideal, but I'm interested in understanding if this "lack of cleanup" generates problems or not. Have you tried to join the same control-plane again? have you tried to join another control-plane? I don't expect problems, but getting a confirmation on this point will really be appreciated.
@luxas might be a little bit of historical context can help here.
@danbeaulieu that's a good point. having an explicit command for syncing cluster status and or enforcing automatic sync when kubeadm is executed is a good idea. Could you kindly report if the above assumption are confirmed by your chaos testing? any feedback will be really appreaciated. |
@fabriziopandini I join the same control-plane node, it's failed. My join steps: The second control-plane node ip is
kubeadm join logs : ...
[etcd] Checking Etcd cluster health
I1207 17:57:18.109993 8541 local.go:66] creating etcd client that connects to etcd pods
I1207 17:57:18.110000 8541 etcd.go:134] checking etcd manifest
I1207 17:57:18.119797 8541 etcd.go:181] etcd endpoints read from pods: https://192.168.46.211:2379,https://192.168.46.212:2379
I1207 17:57:18.131111 8541 etcd.go:221] etcd endpoints read from etcd: https://192.168.46.211:2379
etcd cluster is not healthy: context deadline exceeded I see the kubeadm code, and i think this problem maybe caused by 192.168.46.212 left in the Kubeadm get api endpoints from When i remove the |
@pytimer thanks! |
Yes this does seem like a bug. We have a 3 node control plane ASG. If we terminate an instance a new one will be created per the ASG rules. During this time the terminated node is listed as unhealthy in the member list of etcd. When the new instance comes up, before running The workaround for us is right after doing the etcd membership management is to update the kubeadm-config ConfigMap with the truth of the cluster, and then running Ideally |
@fabianofranz I maybe found the cause of this problem. When sync the etcd endpoints with the real etcd endpoint list, the sync is success. But assign the real etcd endpoints to etcd client I fixed this problem in my fork repository, you can check this PR pytimer/kubernetes@0cdf6ca. And i testing |
@pytimer Looks great! Well spotted! @neolit123 @timothysc ^^^ |
@fabianofranz The first PR is wrong, i forget confirm CLA. This PR kubernetes/kubernetes#71945 you can check. If anything wrong, hope you point out. |
Got same error in kubeadm version 1.13.2, I tried to remove the node manually and update kubeadm-config, it doesn't work, the rest etcd nodes still try to connect the removed node |
@pytimer Can you please elaborate on how you manually removed the old api-server? I am running 1.13.3; removing the old server manually via:
I'm still not able to join the cluster due to the error:
I've then killed the api pods and the etcd pods (2 of each). They get recreated, but I still have the same error when trying to connect the additional node. |
Had the same issue in 1.13.3 (HA cluster setup: 3 master nodes + 3 workers). Successfully replaced master node only after next steps: Delete node from cluster
Download etcdctl (for example, on master01)
Remove master node from etcd
Remove from kubeadm-config
|
@zhangyelong Now kubeadm reset can't remove the etcd member, so you found the etcd cluster still connect the removed etcd node. You should manually remove etcd member using etcdctl now. I send a PR to implement remove the etcd node when reset, you can see. kubernetes/kubernetes#74112 |
@lvangool You can follow @Halytskyi steps. The PR: kubernetes/kubernetes#71945 fixes sync etcd endpoints when join the control plane node, can not remove the etcd member. Remove the etcd member from the etcd cluster when reset, you can see kubernetes/kubernetes#74112. |
This seems to still be a bug in 1.13.4. We still need to manually update the kubeadm config map ala #1300 (comment) Is it not the case that the fix in Interestingly it sporadically works because in golang range over maps, like ClusterStatus, is non-deterministic. So if the first endpoint it finds is from an old endpoint that no longer exists, things fail. If it finds a healthy endpoint it will update the ClusterStatus from the etcd Sync... I believe the root cause of this is a bug in the etcd clientv3 where the bug causes the client not to retry the other endpoints if the first one fails etcd-io/etcd#9949. |
Please use following issue for tracking reset improvements
|
@fabriziopandini There is at least one other issue in here that is unrelated to If a node fails without the chance to perform kubeadm reset (instance termination, hardware failure etc)
This issue may be a good issue to use to track either of those options. |
that is true, without calling reset you will have to manually update the ClusterStatus. |
Just experienced this today on 1.14.1 The instance running one of my master nodes failed which kept it from being gracefully removed. When a new node tried to come in it failed to join due to the error described in this ticket. I had to manually remove the etcd member via etcdctl, then I could join in a new node. I also manually removed the node from the kubeadm-config ConfigMap, but I am not sure if that was required. |
@Halytskyi Thank you etcdctl section helped for me..... |
Experienced this today in 1.15.5 In my case, I joined the cluster but with 1.16 version. then deleted this node Solved by (based on @Halytskyi answer but with updated etcdctl):
>: kubectl edit configmap kubeadm-config -n kube-system
configmap/kubeadm-config edited
root@k8s-nebula-m-115-2:wget https://github.com/etcd-io/etcd/releases/download/v3.4.3/etcd-v3.4.3-linux-amd64.tar.gz
root@k8s-nebula-m-115-2:tar xfz etcd-v3.4.3-linux-amd64.tar.gz root@k8s-nebula-m-115-2:~/etcdctl/etcd-v3.4.3-linux-amd64# ./etcdctl --endpoints https://127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
289ed62da3c6e9e5, started, k8s-nebula-m-115-1, https://10.205.30.2:2380, https://10.205.30.2:2379, false
917e16b9e790c427, started, k8s-nebula-m-115-0, https://10.205.30.1:2380, https://10.205.30.1:2379, false
ad6b76d968b18085, started, k8s-nebula-m-115-2, https://10.205.30.0:2380, https://10.205.30.0:2379, false root@k8s-nebula-m-115-2:~/etcdctl/etcd-v3.4.3-linux-amd64# ./etcdctl --endpoints https://127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 289ed62da3c6e9e5
Member 289ed62da3c6e9e5 removed from cluster d4913a539ea2384e And then rejoin works. |
this can happen if |
So if I delete the node with `kubectl delete node foobar` it does not
delete it from etcd member? But if i do `kubeadm reset` in the node i want
to delete, then it does it? 🙄
…On Wed, 30 Oct 2019, 13:27 Lubomir I. Ivanov, ***@***.***> wrote:
this can happen if kubeadm reset is interrupted and couldn't delete the
node from the kubeadm CM.
in such a case you need to manually delete it from the kubeadm CM.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1300?email_source=notifications&email_token=AF7BZL3Q4E2FMPZYKYNOV53QRF4SXA5CNFSM4GIIZTPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECT7BPQ#issuecomment-547877054>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF7BZL4EOZV7GQYNQOM3773QRF4SXANCNFSM4GIIZTPA>
.
|
"kubeadm reset" should delete it from the kubeadm CM, but calling "kubectl delete node" is also needed which deletes the Node API object. |
In my case deleting the node from de configmap did not delete it form the
etcd cluster i needed to manually `etcdctl delete member`.
…On Thu, 31 Oct 2019 at 16:28, Lubomir I. Ivanov ***@***.***> wrote:
"kubeadm reset" should delete it from the kubeadm CM, but calling "kubectl
delete node" is also needed which deletes the Node API object.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1300?email_source=notifications&email_token=AF7BZLZVF7FFVA3LWINJZW3QRL2TLA5CNFSM4GIIZTPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECYGI4Y#issuecomment-548430963>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF7BZL2KB3GVLTFKQTJTYXLQRL2TLANCNFSM4GIIZTPA>
.
|
kubeadm reset should also remove the etcd member from the etcd cluster. however keep in mind that kubeadm reset is a best effort command so if it fails for some reason it might only print a warning. |
so |
so there are different responsibilities. kubeadm reset resets the node, but it does not delete the Node object for a couple of reasons:
|
Regarding this: when the |
if the node is hard failed and you cannot call kubeadm reset on it, it requires manual steps. you'd have to :
1 and 2 apply only to control-plane nodes. |
Is there any way to automate this fail-over if kubeadm reset cannot be run? |
Same problems on 1.9. Thanks for solutions. |
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT
Versions
kubeadm version (use
kubeadm version
):Environment:
kubectl version
):uname -a
):Linux k8s-lixin-211 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
What happened?
I use
kubeadm reset -f
to reset this control plane node, the command run success. But when i seekubeadm-config
ConfigMap, it already have this node ip in ClusterStatus.I still have a question, why
kubeadm reset
not delete this node directly from the cluster? Instead, runkubectl delete node <node name>
manually.What you expected to happen?
kubeadm-config
ConfigMap remove this node ip.How to reproduce it (as minimally and precisely as possible)?
kubeadm init --config=kubeadm.yml
on the first node.kubeadm join --experimental-control-plane --config=kubeadm.yml
on the second node.kubeadm reset -f
on the second node.kubectl -n kube-system get cm kubeadm-config -oyaml
find the second node ip already in ClusterStatus.Anything else we need to know?
kubeadm-config configMap yaml:
The text was updated successfully, but these errors were encountered: