Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.15 - kubeadm join --control-plane configures kubelet to connect to wrong apiserver #1955

Closed
blurpy opened this issue Dec 3, 2019 · 21 comments
Assignees
Labels
area/HA kind/documentation Categorizes issue or PR as related to documentation.
Milestone

Comments

@blurpy
Copy link

blurpy commented Dec 3, 2019

Is this a BUG REPORT or FEATURE REQUEST?

/kind bug
/area HA

Versions

kubeadm version: v1.15.6

Environment: Dev

  • Kubernetes version: v1.15.6
  • Cloud provider or hardware configuration: Virtualbox
  • OS: CentOS 7.7
  • Kernel: 3.10.0-957.1.3.el7.x86_64
  • Others:

What happened?

kubelet.conf on additional control plane nodes created with kubeadm are configured to connect to the apiserver of the initial master instead of the one on localhost or through the load balancer. This has the consequence of all the kubelets becoming NotReady if the first master is unavailable.

Nodes used in the examples:

  • demomaster1test - 192.168.33.10 - initial master
  • demomaster2test - 192.168.33.20 - additional master
  • demomaster3test - 192.168.33.30 - additional master
  • demolb1test - 192.168.33.100 - load balancer

This example joins against the load balancer:

[root@demomaster2test ~]# kubeadm join --v 5 --discovery-token ... --discovery-token-ca-cert-hash sha256:... --certificate-key ... --control-plane --apiserver-bind-port 443 192.168.33.100:443
...
I1203 08:59:39.136338    7312 join.go:433] [preflight] Discovering cluster-info
I1203 08:59:39.136397    7312 token.go:199] [discovery] Trying to connect to API Server "192.168.33.100:443"
I1203 08:59:39.136875    7312 token.go:74] [discovery] Created cluster-info discovery client, requesting info from "https://192.168.33.100:443"
I1203 08:59:39.147704    7312 token.go:140] [discovery] Requesting info from "https://192.168.33.100:443" again to validate TLS against the pinned public key
I1203 08:59:39.156275    7312 token.go:163] [discovery] Cluster info signature and contents are valid and TLS certificate validates against pinned roots, will use API Server "192.168.33.100:443"
I1203 08:59:39.156294    7312 token.go:205] [discovery] Successfully established connection with API Server "192.168.33.100:443"
...
This node has joined the cluster and a new control plane instance was created

Checking the results:

[root@demomaster2test kubernetes]# grep ":443" *.conf
admin.conf:    server: https://192.168.33.100:443
bootstrap-kubelet.conf:    server: https://192.168.33.10:443
controller-manager.conf:    server: https://192.168.33.100:443
kubelet.conf:    server: https://192.168.33.10:443
scheduler.conf:    server: https://192.168.33.100:443

And this example joins directly against the initial master:

[root@demomaster3test ~]# kubeadm join --v 5 --discovery-token ... --discovery-token-ca-cert-hash sha256:... --certificate-key ... --control-plane --apiserver-bind-port 443 demomaster1test:443
...
I1203 10:43:05.585046    7232 join.go:433] [preflight] Discovering cluster-info
I1203 10:43:05.585107    7232 token.go:199] [discovery] Trying to connect to API Server "demomaster1test:443"
I1203 10:43:05.585473    7232 token.go:74] [discovery] Created cluster-info discovery client, requesting info from "https://demomaster1test:443"
I1203 10:43:05.595627    7232 token.go:140] [discovery] Requesting info from "https://demomaster1test:443" again to validate TLS against the pinned public key
I1203 10:43:05.604432    7232 token.go:163] [discovery] Cluster info signature and contents are valid and TLS certificate validates against pinned roots, will use API Server "demomaster1test:443"
I1203 10:43:05.604453    7232 token.go:205] [discovery] Successfully established connection with API Server "demomaster1test:443"
...
This node has joined the cluster and a new control plane instance was created

Checking the results:

[root@demomaster3test kubernetes]# grep ":443" *.conf
admin.conf:    server: https://192.168.33.100:443
bootstrap-kubelet.conf:    server: https://192.168.33.10:443
controller-manager.conf:    server: https://192.168.33.100:443
kubelet.conf:    server: https://192.168.33.10:443
scheduler.conf:    server: https://192.168.33.100:443

So in both cases kubelet.conf is configured against the initial master, while admin.conf + controller-manager.conf + scheduler.conf are all configured against the load balancer.

What you expected to happen?

kubelet.conf should have been configured to use the load balancer or local apiserver:

[root@demomaster3test kubernetes]# grep ":443" *.conf
kubelet.conf:    server: https://192.168.33.100:443 <-- like this
kubelet.conf:    server: https://192.168.33.30:443 <-- or maybe like this

I'm not sure what is best practice here. Would it make sense for the kubelet on a master to be ready if the apiserver on localhost is unavailable (if configured to use load balancer)?

How to reproduce it (as minimally and precisely as possible)?

Have not tested using the guide at https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/ step by step, since we need to convert existing single master clusters to multi master. So it might work correctly if done according to instructions on completely new clusters. These steps are how we add more masters to our existing cluster:

  1. Find existing 1.15 cluster (possibly upgraded from older versions - ours are initially a lot older)
  2. Update kubeadm-config.yaml to include certSANS and controlPlaneEndpoint for the load balancer:
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
apiServer:
  certSANs:
    - "192.168.33.100"
controlPlaneEndpoint: "192.168.33.100:443"
  1. kubeadm config upload from-file --config /etc/kubernetes/kubeadm-config.yaml
  2. rm -rf /etc/kubernetes/pki/apiserver.*
  3. kubeadm init phase certs apiserver --config=/etc/kubernetes/kubeadm-config.yaml
  4. Restart apiserver
  5. Join 2 new --control-plane nodes like in the examples further up
  6. Shutdown initial master. Watch as the kubelet on the 2 new masters become NotReady.

Anything else we need to know?

It's an easy manual fix. Just edit the ip address in kubelet.conf. Need to do so on the workers as well. But since kubeadm already configures the other .conf-files correctly on the new masters it seems reasonable to expect kubelet.conf to be configured correctly as well. Or maybe there is some parameter I'm missing somewhere to get it right.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/HA labels Dec 3, 2019
@neolit123
Copy link
Member

neolit123 commented Dec 3, 2019

/assign
i will try to reproduce this tomorrow.

although,

Shutdown initial master. Watch as the kubelet on the 2 new masters become NotReady.

i was testing something unrelated and doing the above and the 2 other master did not become NotReady. this could be an artifact of your older cluster upgrade.

in any case you might have to apply your manual fix, as we cannot backport this to < 1.18 releases as it does not match the k8s backport criteria.

@blurpy
Copy link
Author

blurpy commented Dec 4, 2019

Thanks!

I've done some tests with a completely new 1.15 cluster, following the instructions from https://v1-15.docs.kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/ to create the initial master with controlPlaneEndpoint in kubeadm-config.yaml from the start, and it looks better:

[root@demomaster1test kubernetes]# grep ":443" *.conf
admin.conf:    server: https://192.168.33.100:443
controller-manager.conf:    server: https://192.168.33.100:443
kubelet.conf:    server: https://192.168.33.100:443
scheduler.conf:    server: https://192.168.33.100:443

[root@demomaster2test kubernetes]# grep ":443" *.conf
admin.conf:    server: https://192.168.33.100:443
bootstrap-kubelet.conf:    server: https://192.168.33.100:443
controller-manager.conf:    server: https://192.168.33.100:443
kubelet.conf:    server: https://192.168.33.100:443
scheduler.conf:    server: https://192.168.33.100:443

[root@demomaster3test kubernetes]# grep ":443" *.conf
admin.conf:    server: https://192.168.33.100:443
bootstrap-kubelet.conf:    server: https://192.168.33.100:443
controller-manager.conf:    server: https://192.168.33.100:443
kubelet.conf:    server: https://192.168.33.100:443
scheduler.conf:    server: https://192.168.33.100:443

[root@demoworker1test kubernetes]# grep ":443" *.conf
bootstrap-kubelet.conf:    server: https://192.168.33.100:443
kubelet.conf:    server: https://192.168.33.100:443

All references point to the load balancer, on all masters, and also the workers. Including the bootstrap-config.

I joined one master against the load balancer, and the other against the initial master. All the workers were joined against the initial master.

So there must be some state that kubeadm is reading from the initial master to get the address of the load balancer, but it can't be the controlPlaneEndpoint in the kubeadm-config configmap in kube-system right? In my previous test I uploaded a new version of that before adding a second master, otherwise kubeadm aborts, complaining about not having a stable control plane address. And it did actually use the value from controlPlaneEndpoint in the other config files.

@blurpy
Copy link
Author

blurpy commented Dec 4, 2019

in any case you might have to apply your manual fix, as we cannot backport this to < 1.18 releases as it does not match the k8s backport criteria.

That's OK. What is the criteria for backports?

@blurpy
Copy link
Author

blurpy commented Dec 4, 2019

I've done a couple of more experiments with the first cluster.

  1. Updated the server field in kube-proxy configmap to point to load balancer and restarting kube-proxy, before adding second master. No difference.
  2. Updated all the references in /etc/kubernetes/*.conf files to point to the load balancer in the initial master and reboot the node, before adding a second master. No difference.

@blurpy
Copy link
Author

blurpy commented Dec 4, 2019

I managed to figure out where it got the kubelet address from.

[root@demomaster1test ~]# kubectl -n kube-public get cm cluster-info -o yaml | grep ":443"
        server: https://192.168.33.10:443

I changed the configmap to point to the load balancer address before joining a new master, and this is the result:

[root@demomaster3test kubernetes]# grep ":443" *.conf
admin.conf:    server: https://192.168.33.100:443
bootstrap-kubelet.conf:    server: https://192.168.33.100:443
controller-manager.conf:    server: https://192.168.33.100:443
kubelet.conf:    server: https://192.168.33.100:443
scheduler.conf:    server: https://192.168.33.100:443

It would be nice if kubeadm would handle all the little details when migrating from single to multi master setup. But maybe it's a new feature instead of a bug.

@neolit123
Copy link
Member

neolit123 commented Dec 4, 2019

All references point to the load balancer, on all masters, and also the workers. Including the bootstrap-config.

yes, this was my assumption.

That's OK. What is the criteria for backports?

critical, blocking bugs without known workarounds - go panics, security bugs etc.

It would be nice if kubeadm would handle all the little details when migrating from single to multi master setup. But maybe it's a new feature instead of a bug.

i would like more eyes on this problem and we might be able to backport a fix, but no promises.

@ereslibre PTAL too.

@blurpy
Copy link
Author

blurpy commented Dec 5, 2019

i would like more eyes on this problem and we might be able to backport a fix, but no promises.

Nice!

@ereslibre
Copy link
Contributor

Oh wow, thanks for this report @blurpy. I'm going to check if I can reproduce this issue. Thanks for the heads up @neolit123.

/assign

@neolit123 neolit123 added this to the v1.18 milestone Dec 5, 2019
@timfeirg
Copy link

timfeirg commented Dec 6, 2019

studied the code for a bit, I think it's because:

that being said, the current workaround is to:

  • k -n kube-public edit cm cluster-info and manually change the server address to current controlPlaneEndpoint, or simply run kubeadm init phase bootstrap-token again
  • on worker node, kubeadm join phase kubelet-start

after controlPlaneEndpoint has been changed.

@blurpy

@timfeirg
Copy link

timfeirg commented Dec 6, 2019

I don't think kubeadm join phase kubelet-start should use cluster-info when rendering /etc/kubernetes/kubelet.conf. it's doesn't seem properly maintained in the current kubeadm workflow (not in the official docs). is there any design concerns behind this?

I'm not familiar with the code, but I think we can obtain a somewhat more precise view of the current cluster configuration from the kubeadm-config config map. it's been mentioned in several community tutorials (like this one) that after controlPlaneEndpoint change, we should re-upload kubeadm config.

willing to draft a PR on this if the above suggestion made any sense. @ereslibre @neolit123

@neolit123
Copy link
Member

neolit123 commented Dec 6, 2019

the cluster-info config map didn't get updated when a new control-plane node is added.

changing the controlPlaneEndpoint is not really something that kubeadm supports. it also means that users need to re-sign certificates. so cluster-info seems like a valid source of truth for the server address as long as the user knows how to setup their cluster long term properly.

using an FQDN for the controlPlaneEndpoint however is something that is encouraged even for single-control-plane scenarios.

https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/#considerations-about-apiserver-advertise-address-and-controlplaneendpoint
TL;DR always use controlPlaneEndpoint with a DNS name.

willing to draft a PR on this if the above suggestion made any sense.

we should have a better discussion before such a PR.

@timfeirg
Copy link

timfeirg commented Dec 6, 2019

changing the controlPlaneEndpoint is not really something that kubeadm supports.

and yet this seems very needed, if one fails to notice this and uses the initial master ip address as controlPlaneEndpoint, there'll be no way to migrate to a HA setup afterwards, not in the workflows described in the official documents.

community had already hack their way to manage this, although with problems here and there. but I think the tutorial mentioned above, together with the workarounds addressed in this issue, already form a viable solution to migrating controlPlaneEndpoint in a cluster.

@neolit123
Copy link
Member

and yet this seems very needed, if one fails to notice this and uses the initial master ip address as controlPlaneEndpoint, there'll be no way to migrate to a HA setup afterwards, not in the workflows described in the official documents.

with Kubernetes there are at least 100 ways for the cluster operator to shoot themself in the foot...long term even.
the kubeadm maintainers are doing a best effort on documenting the good and bad practices.
migrating to HA works nicely if one uses the CPE from the beginning.

community had already hack their way to manage this, although with problems here and there. but I think the tutorial mentioned above, together with the workarounds addressed in this issue, already form a viable solution to migrating controlPlaneEndpoint in a cluster.

we might provide a way to "change the cluster" like that using the kubeadm operator, but the work is still experimental:
#1698

@timfeirg
Copy link

timfeirg commented Dec 6, 2019

consider adding a check somewhere in the kubeadm init process, and warn the user if they are not using CPE?

@neolit123
Copy link
Member

i'm personally not in favor of adding such a warning, given we have this in the single-cp guide, but others should comment too.

@blurpy
Copy link
Author

blurpy commented Dec 9, 2019

the kubeadm maintainers are doing a best effort on documenting the good and bad practices.
migrating to HA works nicely if one uses the CPE from the beginning.

As far as I can see, mention of CPE was added to the docs for 1.16, so I'm guessing there are a lot more clusters out there missing this. There's also the case where you have to change the endpoint address for some reason.

we might provide a way to "change the cluster" like that using the kubeadm operator, but the work is still experimental.

Would it be a lot of work to add a phase for kubeadm to change the CPE (for the shorter term), based on the value in the kubeadm-config-file? It seems like a useful function to have in kubeadm. And then there's no need for all the warnings, because moving to HA is a supported strategy even if you never thought you needed it when you started off.

@neolit123
Copy link
Member

neolit123 commented Dec 9, 2019

There's also the case where you have to change the endpoint address for some reason.

that is why one should use a domain name.

Would it be a lot of work to add a phase for kubeadm to change the CPE (for the shorter term), based on the value in the kubeadm-config-file? It seems like a useful function to have in kubeadm. And then there's no need for all the warnings, because moving to HA is a supported strategy even if you never thought you needed it when you started off.

moving to HA is only supported if one was using the control-plane endpoint. if they have not used control-plane endpoint moving to HA is difficult and i don' think we should add a phase for it. a "phase" in kubeadm terms is something that is part of the standard workflow.

@blurpy
Copy link
Author

blurpy commented Dec 9, 2019

that is why one should use a domain name.

It's what I was thinking about. Domain names are not always forever.

if they have not used control-plane endpoint moving to HA is difficult and i don' think we should add a phase for it

It is definitely difficult, which is why having help from kubeadm would be very helpful. Do you think the steps described in this issue are not safe for production use?

a "phase" in kubeadm terms is something that is part of the standard workflow.

Could moving between non-HA and HA be a standard workflow?

@neolit123
Copy link
Member

Could moving between non-HA and HA be a standard workflow?

not for the init or join phases.
this might be a separate command.

but it might make more sense to have this as a guide in the docs without introducing new commands.

@neolit123
Copy link
Member

neolit123 commented Jan 20, 2020

getting back to this.

It's what I was thinking about. Domain names are not always forever.

it's the responsibility of the operator to prepare the right infrastructure and guarantee connectivity in the cluster.

as can be seen in the discussion in #338 changing the master IP is not so simple and can cause a number issues including such in Services and Pod network plugins. it can disrupt workloads, controllers and the cluster operation as a whole. this is not a kubeadm only problem, but a k8s problem where too many modules can depend on a hardcoded IP.

new users should be really careful when the pick the cluster endpoint!

we document this here:
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/#considerations-about-apiserver-advertise-address-and-controlplaneendpoint

picking a domain name is highly advised. running a local network DNS server with CNAME is an option. another option is to use mapping in /etc/hosts for every node.

existing users of single CP nodes IP endpoints, that wish to move to a new IP or to using a control-plane-endpoint FQDN can follow the manual guides the k8s community created, but be wary that such a guide may not cover all the details of their clusters.

if someone is willing to work on a documentation PR for the k8s website on this topic, please log a new k/kubeadm issue with your proposal and let's discuss it first. thanks!

/close

@k8s-ci-robot
Copy link
Contributor

@neolit123: Closing this issue.

In response to this:

getting back to this.

It's what I was thinking about. Domain names are not always forever.

it's the responsibility of the operator to prepare the right infrastructure and guarantee connectivity in the cluster.

as can be seen in the discussion in #338 changing the master IP is not so simple and can cause a number issues including such in Services and Pod network plugins. it can disrupt workloads, controllers and the cluster operation as a whole. this is not a kubeadm only problem, but a k8s problem where too many modules can depend on a hardcoded IP.

new users should be really careful when the pick the cluster endpoint!

we document this here:
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/#considerations-about-apiserver-advertise-address-and-controlplaneendpoint

picking a domain name is highly advised. running a local network DNS server with CNAME is an option. another option is to use mapping in /etc/hosts for every node.

existing users of single CP nodes IP endpoints, that wish to move to a new IP or to using a control-plane-endpoint FQDN should follow the manual guides the k8s community created, but be wary that such a guide may not cover all the details of their clusters.

if someone is willing to work on a documentation PR for the k8s website on this topic, please go log a new k/kubeadm issue with your proposal and let's discuss it. thanks!

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@neolit123 neolit123 added kind/documentation Categorizes issue or PR as related to documentation. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jan 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/HA kind/documentation Categorizes issue or PR as related to documentation.
Projects
None yet
Development

No branches or pull requests

5 participants