Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubeadm update to 1.10 fails on ha k8s/etcd cluster #837

Closed
brokenmass opened this issue May 20, 2018 · 13 comments
Closed

Kubeadm update to 1.10 fails on ha k8s/etcd cluster #837

brokenmass opened this issue May 20, 2018 · 13 comments
Assignees
Labels
area/HA area/upgrades area/UX documentation/improvement kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@brokenmass
Copy link

brokenmass commented May 20, 2018

BUG REPORT

Versions

kubeadm version: 1.10.2

Environment:

  • Kubernetes version: 1.9.3
  • Cloud provider or hardware configuration: 3 x k8s master HA
  • OS: RHEL7
  • Kernel: 3.10.0-693.11.6.el7.x86_64

What happened?

A couple of months ago I created a kubernetes 1.9.3 HA cluster using kubeadm 1.9.3, following the 'official' documentation https://kubernetes.io/docs/setup/independent/high-availability/ , setting up the etcd HA cluster hosting it on the master nodes using static pods

I wanted to upgrade my cluster to k8s 1.10.2 using the latest kubeadm; after updating kubeadm, when running kubeadm upgrade plan, I got the following error:

[root@shared-cob-01 tmp]# kubeadm upgrade plan
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/plan] computing upgrade possibilities
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.9.3
[upgrade/versions] kubeadm version: v1.10.2
[upgrade/versions] Latest stable version: v1.10.2
[upgrade/versions] FATAL: context deadline exceeded

I investigate the issue and found the 2 root causes:

1) kubeadm doesn't identify etcd cluster as TLS enabled

The guide instruct to use the following command in the etcd static pod

- etcd --name <name> \
  - --data-dir /var/lib/etcd \
  - --listen-client-urls http://localhost:2379 \
  - --advertise-client-urls http://localhost:2379 \
  - --listen-peer-urls http://localhost:2380 \
  - --initial-advertise-peer-urls http://localhost:2380 \
  - --cert-file=/certs/server.pem \
  - --key-file=/certs/server-key.pem \
  - --client-cert-auth \
  - --trusted-ca-file=/certs/ca.pem \
  - --peer-cert-file=/certs/peer.pem \
  - --peer-key-file=/certs/peer-key.pem \
  - --peer-client-cert-auth \
  - --peer-trusted-ca-file=/certs/ca.pem \
  - --initial-cluster etcd0=https://<etcd0-ip-address>:2380,etcd1=https://<etcd1-ip-address>:2380,etcd2=https://<etcd2-ip-address>:2380 \
  - --initial-cluster-token my-etcd-token \
  - --initial-cluster-state new

kubeadm >= 1.10 checks (here: https://github.com/kubernetes/kubernetes/blob/release-1.10/cmd/kubeadm/app/util/etcd/etcd.go#L56) if etcd has TLS enabled by checking the presence of the following flags in the static pod command.

"--cert-file=",
"--key-file=",
"--trusted-ca-file=",
"--client-cert-auth=",
"--peer-cert-file=",
"--peer-key-file=",
"--peer-trusted-ca-file=",
"--peer-client-cert-auth=",

but as the flags --client-cert-auth and --peer-client-cert-auth are used in the instructions without any parameter (being booleans) kubeadm didn’t recognise the etcd cluster to have TLS enabled.

PERSONAL FIX:
I updated my etcd static pod command to use - --client-cert-auth=true and - --peer-client-cert-auth=true

GENERAL FIX:
Update the instructions to use --client-cert-auth=true and --peer-client-cert-auth=true and relax kubeadm checks using "--peer-cert-file" and"--peer-key-file" (without the equals)

2) kubeadm didn't use the correct certificates

after fixing point 1, the problem still persisted as kubeadm was not using the right certificates.
By following the kubeadm HA guide, in fact, the created certificates are ca.pem ca-key.pem peer.pem peer-key.pem client.pem client-key.pem but the latest kubeadm expects ca.crt ca.key``peer.crt peer.key``healthcheck-client.crt healthcheck-client.key instead.
Yhe kubeadm-config MasterConfiguration keys etcd.caFile, etcd.certFile and etcd.keyFile are ignored.

PERSONAL FIX:
Renamed .pem certificate to their .crt and .key equivalent and updated the etcd static pod configuration to use them.

GENERAL FIX:
Use the kubeadm-config data.caFile, data.certFile and data.keyFile values, infer the right certificates from etcd static pod definition (pod path + volumes hostPath) and/or create a new temporary client certificate to use during the upgrade.

What you expected to happen?

The upgrade plan should have been executed correctly

How to reproduce it (as minimally and precisely as possible)?

create a k8s ha cluster using kubeadm 1.9.3 following https://kubernetes.io/docs/setup/independent/high-availability/ and try to update it to k8s >= 1.10 using the latest kubeadm

@neolit123 neolit123 added kind/bug Categorizes issue or PR as related to a bug. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. documentation/improvement area/HA area/upgrades area/UX labels May 20, 2018
@brokenmass
Copy link
Author

brokenmass commented May 22, 2018

this issue seems to be fixed in kubeadm 1.10.3, even though it will not automatically update the static etcd pod as it recognise it as 'external'

@FloMedja
Copy link

FloMedja commented May 22, 2018

I am using kubeadm 1.10.3 and have the same issues . My cluster is 1.10.2 with an external secure etcd

@FloMedja
Copy link

FloMedja commented May 22, 2018

@brokenmass Does the values for your personnal fixes to the second cause you notice look like this :

  caFile: /etc/kubernetes/pki/etcd/ca.crt
  certFile: /etc/kubernetes/pki/etcd/healthcheck-client.crt
  keyFile: /etc/kubernetes/pki/etcd/healthcheck-client.key

@FloMedja
Copy link

@detiber can you help please ?

@brokenmass
Copy link
Author

@FloMedja
in my case the values looks like :

  caFile: /etc/kubernetes/pki/etcd/ca.pem
  certFile: /etc/kubernetes/pki/etcd/client.pem
  keyFile: /etc/kubernetes/pki/etcd/client-key.pem

and 1.10.3 is working correctly

@FloMedja
Copy link

@brokenmass So with kubeadm 1.10.3 everything work without no need of your personals fixes. In this case i am little confused. I have kubeadm 1.10.3 but the same error message that you mention in this bug report. I will double check my config may be i make some mistakes elsewhere

@brokenmass
Copy link
Author

brokenmass commented May 23, 2018

add here (or join kubernetes slack and send me a direct message) your kubeadm-config, etcd static pods yml and the full output of kubeadm upgrade plan

@detiber
Copy link
Member

detiber commented May 24, 2018

My apologies, I'm just now seeing this. @chuckha did the original work for the static-pod HA etcd docs, I'll work with him over the next couple of days to see if we can help straighten out the HA upgrades.

@FloMedja
Copy link

@detiber thanks you. the upgrade plan finally work. but i face some race conditions issues when tries to upgrade the cluster. sometime it work sometimes i hae the same error as kubernetes/kubeadm/issues/850 . kubeadm run into race condition when try to restart a pod on one node.

@detiber
Copy link
Member

detiber commented May 25, 2018

I ran into some snags getting a test env setup for this today and I'm running out of time before my weekend starts. I'll pick back up on this early next week.

@timothysc
Copy link
Member

/assign @chuckha @detiber

@timothysc timothysc added this to the v1.11 milestone May 25, 2018
@timothysc timothysc removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label May 25, 2018
@timothysc timothysc added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label May 29, 2018
@luxas
Copy link
Member

luxas commented Jun 12, 2018

@chuckha @detiber @stealthybox any update on this?

@timothysc timothysc modified the milestones: v1.11, v1.12 Jun 13, 2018
@timothysc timothysc assigned timothysc and unassigned chuckha Aug 21, 2018
@timothysc
Copy link
Member

So 1.9->1.10 HA upgrade was not a supported or vetted path.

We are currently in progress on updating our maintain our docs for 1.11->1.12 which we do plan to maintain going forwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/HA area/upgrades area/UX documentation/improvement kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

7 participants