kubeadm upgrade on arm from 1.8.5 -> 1.9.0 fails #599

brendandburns · 2017-12-18T04:51:25Z

What keywords did you search in kubeadm issues before filing this one?

upgrade, TLS

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version): 1.9.0

Environment:

Kubernetes version (use kubectl version): 1.8.5
Cloud provider or hardware configuration: arm (Raspberry Pi)
OS (e.g. from /etc/os-release): hypriot/raspbian
Kernel (e.g. uname -a):4.4.50
Others:

What happened?

tried kubeadm upgrade .. which timed out.

Manually copied in kube-apiserver.yaml that kubeadm generated.

What you expected to happen?

Upgrade to 1.9.0 should work.

How to reproduce it (as minimally and precisely as possible)?

Install a 1.8.5 cluster, upgrade to 1.9.0 using kubeadm

Anything else we need to know?

Apiserver logs look like:

1218 04:40:42.704397       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *rbac.Role: Get https://127.0.0.1:6443/apis/rbac.authorization.k8s.io/v1/roles?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.705841       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.ResourceQuota: Get https://127.0.0.1:6443/api/v1/resourcequotas?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.707026       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *rbac.ClusterRole: Get https://127.0.0.1:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.708110       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *rbac.RoleBinding: Get https://127.0.0.1:6443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.709105       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.ServiceAccount: Get https://127.0.0.1:6443/api/v1/serviceaccounts?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.710080       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.Pod: Get https://127.0.0.1:6443/api/v1/pods?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.711157       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.PersistentVolume: Get https://127.0.0.1:6443/api/v1/persistentvolumes?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.712340       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *rbac.ClusterRoleBinding: Get https://127.0.0.1:6443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?limit=500&resourceVersion=0: net/http: TLS handshake timeout
I1218 04:40:42.717755       1 logs.go:41] http: TLS handshake error from 10.0.0.3:44016: EOF
I1218 04:40:42.746483       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39322: EOF
I1218 04:40:42.792235       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39326: EOF
I1218 04:40:42.873760       1 logs.go:41] http: TLS handshake error from 10.0.0.4:36825: EOF
I1218 04:40:42.887385       1 logs.go:41] http: TLS handshake error from 10.0.0.3:44010: EOF
I1218 04:40:42.906466       1 logs.go:41] http: TLS handshake error from 127.0.0.1:59682: EOF
I1218 04:40:42.961715       1 logs.go:41] http: TLS handshake error from 10.0.0.2:46824: EOF
I1218 04:40:42.983181       1 logs.go:41] http: TLS handshake error from 10.0.0.4:42166: EOF
I1218 04:40:43.035847       1 logs.go:41] http: TLS handshake error from 10.0.0.4:36844: EOF
I1218 04:40:43.073853       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39706: EOF
I1218 04:40:43.101099       1 logs.go:41] http: TLS handshake error from 10.0.0.3:43986: EOF
I1218 04:40:43.106547       1 logs.go:41] http: TLS handshake error from 10.0.0.2:46846: EOF
I1218 04:40:43.124883       1 logs.go:41] http: TLS handshake error from 10.0.0.2:59200: EOF
I1218 04:40:43.135636       1 logs.go:41] http: TLS handshake error from 10.0.0.2:38988: EOF
I1218 04:40:43.139734       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39344: EOF
I1218 04:40:43.276876       1 logs.go:41] http: TLS handshake error from 127.0.0.1:59676: read tcp 127.0.0.1:6443->127.0.0.1:59676: read: connection reset by peer
I1218 04:40:43.295881       1 logs.go:41] http: TLS handshake error from 10.0.0.4:36894: EOF
I1218 04:40:43.328730       1 logs.go:41] http: TLS handshake error from 10.0.0.2:39052: EOF
I1218 04:40:43.437586       1 logs.go:41] http: TLS handshake error from 127.0.0.1:59668: EOF
I1218 04:40:43.457870       1 logs.go:41] http: TLS handshake error from 127.0.0.1:59684: read tcp 127.0.0.1:6443->127.0.0.1:59684: read: connection reset by peer
I1218 04:40:43.463332       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39698: EOF
I1218 04:40:43.482961       1 logs.go:41] http: TLS handshake error from 10.0.0.2:40512: EOF
I1218 04:40:43.543943       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39312: EOF
I1218 04:40:43.598015       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39330: EOF
I1218 04:40:43.638007       1 logs.go:41] http: TLS handshake error from 10.0.0.4:36856: EOF
I1218 04:40:43.661470       1 logs.go:41] http: TLS handshake error from 10.0.0.3:58758: EOF
I1218 04:40:43.685554       1 logs.go:41] http: TLS handshake error from 10.0.0.3:44012: EOF
I1218 04:40:43.710389       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39711: EOF
I1218 04:40:43.714225       1 logs.go:41] http: TLS handshake error from 10.0.0.2:46822: EOF
I1218 04:40:43.720630       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39400: EOF
I1218 04:40:43.741250       1 logs.go:41] http: TLS handshake error from 127.0.0.1:59654: EOF
I1218 04:40:43.947767       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39404: EOF
E1218 04:40:43.949289       1 client_ca_hook.go:78] Post https://127.0.0.1:6443/api/v1/namespaces: net/http: TLS handshake timeout
F1218 04:40:43.950279       1 controller.go:133] Unable to perform initial IP allocation check: unable to refresh the service IP block: Get https://127.0.0.1:6443/api/v1/services: net/http: TLS handshake timeout
I1218 04:40:44.639152       1 logs.go:41] http: TLS handshake error from 10.0.0.2:40712: EOF
I1218 04:40:46.267009       1 logs.go:41] http: TLS handshake error from 10.0.0.4:42148: EOF
I1218 04:40:46.267803       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39664: EOF
I1218 04:40:46.268393       1 logs.go:41] http: TLS handshake error from 10.0.0.2:40482: EOF
I1218 04:40:46.268963       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39350: EOF
I1218 04:40:46.269512       1 logs.go:41] http: TLS handshake error from 10.0.0.4:36906: EOF
I1218 04:40:46.269994       1 logs.go:41] http: TLS handshake error from 10.0.0.2:40474: EOF
I1218 04:40:46.270533       1 logs.go:41] http: TLS handshake error from 127.0.0.1:59686: EOF

The text was updated successfully, but these errors were encountered:

luxas · 2017-12-18T07:58:49Z

Is etcd still working? Can you paste the output of kubeadm upgrade?
I could try to reproduce this as well on an ARM machine -- we have automated upgrade tests running for the normal case so I guess this might be something arm32-specific...?

brendandburns · 2017-12-21T17:57:56Z

etcd is still working (though I had to manually upgrade etcd to 3.1.10 because the kubeadm upgrade timed out before etcd came back up when it tried to upgrade)

When I revert back to the old 1.8.5 apiserver, the whole cluster snaps back into correct operation.

I'll try the upgrade again this evening and I'll send in more detailed logs.

brendandburns · 2017-12-23T19:58:11Z

Here's the output from kubeadm

[upgrade/version] You have chosen to change the cluster version to "v1.9.0"
[upgrade/versions] Cluster version: v1.8.5
[upgrade/versions] kubeadm version: v1.9.0
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.9.0"...
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests105458021"
[controlplane] Wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests105458021/kube-apiserver.yaml"
[controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests105458021/kube-controller-manager.yaml"
[controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests105458021/kube-scheduler.yaml"
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests586955648/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]

brendandburns · 2017-12-27T05:07:56Z

So I dug into this a little more. I think there are two underlying issues:

The default "time-to-healthy" for the apiserver is too short (at least on my rpis...) it is set to 15 seconds, but it takes longer than that on my node for the apiserver to come up. Changing it to 300 fixed things, this should probably be configurable in kubeadm...
Kubernetes 1.9.0 appears to be right on the edge in terms of memory use for what an rpi stack can handle. At steady state, my master node as ~60Mb of RAM free, and when an APIserver is just coming up and under heavy load from various components, it drops even lower than that.

Not too much that can be done here, I pulled a profile, and though there are some improvements that could help things, there's no low-hanging fruit...

The "right" answer would be to move etcd or some other component to a different node to relieve some of the memory pressure.

0xmichalis · 2018-01-07T21:39:28Z

Kubernetes 1.9.0 appears to be right on the edge in terms of memory use for what an rpi stack can handle. At steady state, my master node as ~60Mb of RAM free, and when an APIserver is just coming up and under heavy load from various components, it drops even lower than that.

Our docs already suggest using at least 2gb of RAM machines. This is unfortunate for rpis but there are other ARM options like odroid c2 that cover the requirements plus are known to run k8s (and are known to outperform rpis). I am waiting for two rock64 machines with 4gb of RAM each, hoping to get them working, too.

Closing in favor of #644

/close

0xmichalis · 2018-01-07T21:45:15Z

Also, this may be an issue with the OS you are running. I am also using raspberry pis, running with the stock raspbian lite image, and have performed all upgrades ever since 1.7 successfully (up to the latest - 1.9.1). There are also even more lightweight alternatives like dietpi which I can confirm is working like a dream on a rpi as a k8s node.

luxas added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Dec 18, 2017

k8s-ci-robot assigned 0xmichalis Jan 7, 2018

k8s-ci-robot closed this as completed Jan 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubeadm upgrade on arm from 1.8.5 -> 1.9.0 fails #599

kubeadm upgrade on arm from 1.8.5 -> 1.9.0 fails #599

brendandburns commented Dec 18, 2017

luxas commented Dec 18, 2017

brendandburns commented Dec 21, 2017

brendandburns commented Dec 23, 2017

brendandburns commented Dec 27, 2017

0xmichalis commented Jan 7, 2018

0xmichalis commented Jan 7, 2018 •

edited

kubeadm upgrade on arm from 1.8.5 -> 1.9.0 fails #599

kubeadm upgrade on arm from 1.8.5 -> 1.9.0 fails #599

Comments

brendandburns commented Dec 18, 2017

What keywords did you search in kubeadm issues before filing this one?

Is this a BUG REPORT or FEATURE REQUEST?

Versions

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

luxas commented Dec 18, 2017

brendandburns commented Dec 21, 2017

brendandburns commented Dec 23, 2017

brendandburns commented Dec 27, 2017

0xmichalis commented Jan 7, 2018

0xmichalis commented Jan 7, 2018 • edited

0xmichalis commented Jan 7, 2018 •

edited