Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm upgrade on arm from 1.8.5 -> 1.9.0 fails #599

Closed
brendandburns opened this issue Dec 18, 2017 · 6 comments
Closed

kubeadm upgrade on arm from 1.8.5 -> 1.9.0 fails #599

brendandburns opened this issue Dec 18, 2017 · 6 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@brendandburns
Copy link

What keywords did you search in kubeadm issues before filing this one?

upgrade, TLS

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version): 1.9.0

Environment:

  • Kubernetes version (use kubectl version): 1.8.5
  • Cloud provider or hardware configuration: arm (Raspberry Pi)
  • OS (e.g. from /etc/os-release): hypriot/raspbian
  • Kernel (e.g. uname -a):4.4.50
  • Others:

What happened?

tried kubeadm upgrade .. which timed out.

Manually copied in kube-apiserver.yaml that kubeadm generated.

What you expected to happen?

Upgrade to 1.9.0 should work.

How to reproduce it (as minimally and precisely as possible)?

Install a 1.8.5 cluster, upgrade to 1.9.0 using kubeadm

Anything else we need to know?

Apiserver logs look like:

1218 04:40:42.704397       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *rbac.Role: Get https://127.0.0.1:6443/apis/rbac.authorization.k8s.io/v1/roles?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.705841       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.ResourceQuota: Get https://127.0.0.1:6443/api/v1/resourcequotas?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.707026       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *rbac.ClusterRole: Get https://127.0.0.1:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.708110       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *rbac.RoleBinding: Get https://127.0.0.1:6443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.709105       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.ServiceAccount: Get https://127.0.0.1:6443/api/v1/serviceaccounts?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.710080       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.Pod: Get https://127.0.0.1:6443/api/v1/pods?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.711157       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.PersistentVolume: Get https://127.0.0.1:6443/api/v1/persistentvolumes?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.712340       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *rbac.ClusterRoleBinding: Get https://127.0.0.1:6443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?limit=500&resourceVersion=0: net/http: TLS handshake timeout
I1218 04:40:42.717755       1 logs.go:41] http: TLS handshake error from 10.0.0.3:44016: EOF
I1218 04:40:42.746483       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39322: EOF
I1218 04:40:42.792235       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39326: EOF
I1218 04:40:42.873760       1 logs.go:41] http: TLS handshake error from 10.0.0.4:36825: EOF
I1218 04:40:42.887385       1 logs.go:41] http: TLS handshake error from 10.0.0.3:44010: EOF
I1218 04:40:42.906466       1 logs.go:41] http: TLS handshake error from 127.0.0.1:59682: EOF
I1218 04:40:42.961715       1 logs.go:41] http: TLS handshake error from 10.0.0.2:46824: EOF
I1218 04:40:42.983181       1 logs.go:41] http: TLS handshake error from 10.0.0.4:42166: EOF
I1218 04:40:43.035847       1 logs.go:41] http: TLS handshake error from 10.0.0.4:36844: EOF
I1218 04:40:43.073853       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39706: EOF
I1218 04:40:43.101099       1 logs.go:41] http: TLS handshake error from 10.0.0.3:43986: EOF
I1218 04:40:43.106547       1 logs.go:41] http: TLS handshake error from 10.0.0.2:46846: EOF
I1218 04:40:43.124883       1 logs.go:41] http: TLS handshake error from 10.0.0.2:59200: EOF
I1218 04:40:43.135636       1 logs.go:41] http: TLS handshake error from 10.0.0.2:38988: EOF
I1218 04:40:43.139734       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39344: EOF
I1218 04:40:43.276876       1 logs.go:41] http: TLS handshake error from 127.0.0.1:59676: read tcp 127.0.0.1:6443->127.0.0.1:59676: read: connection reset by peer
I1218 04:40:43.295881       1 logs.go:41] http: TLS handshake error from 10.0.0.4:36894: EOF
I1218 04:40:43.328730       1 logs.go:41] http: TLS handshake error from 10.0.0.2:39052: EOF
I1218 04:40:43.437586       1 logs.go:41] http: TLS handshake error from 127.0.0.1:59668: EOF
I1218 04:40:43.457870       1 logs.go:41] http: TLS handshake error from 127.0.0.1:59684: read tcp 127.0.0.1:6443->127.0.0.1:59684: read: connection reset by peer
I1218 04:40:43.463332       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39698: EOF
I1218 04:40:43.482961       1 logs.go:41] http: TLS handshake error from 10.0.0.2:40512: EOF
I1218 04:40:43.543943       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39312: EOF
I1218 04:40:43.598015       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39330: EOF
I1218 04:40:43.638007       1 logs.go:41] http: TLS handshake error from 10.0.0.4:36856: EOF
I1218 04:40:43.661470       1 logs.go:41] http: TLS handshake error from 10.0.0.3:58758: EOF
I1218 04:40:43.685554       1 logs.go:41] http: TLS handshake error from 10.0.0.3:44012: EOF
I1218 04:40:43.710389       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39711: EOF
I1218 04:40:43.714225       1 logs.go:41] http: TLS handshake error from 10.0.0.2:46822: EOF
I1218 04:40:43.720630       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39400: EOF
I1218 04:40:43.741250       1 logs.go:41] http: TLS handshake error from 127.0.0.1:59654: EOF
I1218 04:40:43.947767       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39404: EOF
E1218 04:40:43.949289       1 client_ca_hook.go:78] Post https://127.0.0.1:6443/api/v1/namespaces: net/http: TLS handshake timeout
F1218 04:40:43.950279       1 controller.go:133] Unable to perform initial IP allocation check: unable to refresh the service IP block: Get https://127.0.0.1:6443/api/v1/services: net/http: TLS handshake timeout
I1218 04:40:44.639152       1 logs.go:41] http: TLS handshake error from 10.0.0.2:40712: EOF
I1218 04:40:46.267009       1 logs.go:41] http: TLS handshake error from 10.0.0.4:42148: EOF
I1218 04:40:46.267803       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39664: EOF
I1218 04:40:46.268393       1 logs.go:41] http: TLS handshake error from 10.0.0.2:40482: EOF
I1218 04:40:46.268963       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39350: EOF
I1218 04:40:46.269512       1 logs.go:41] http: TLS handshake error from 10.0.0.4:36906: EOF
I1218 04:40:46.269994       1 logs.go:41] http: TLS handshake error from 10.0.0.2:40474: EOF
I1218 04:40:46.270533       1 logs.go:41] http: TLS handshake error from 127.0.0.1:59686: EOF
@luxas
Copy link
Member

luxas commented Dec 18, 2017

Is etcd still working? Can you paste the output of kubeadm upgrade?
I could try to reproduce this as well on an ARM machine -- we have automated upgrade tests running for the normal case so I guess this might be something arm32-specific...?

@luxas luxas added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Dec 18, 2017
@brendandburns
Copy link
Author

etcd is still working (though I had to manually upgrade etcd to 3.1.10 because the kubeadm upgrade timed out before etcd came back up when it tried to upgrade)

When I revert back to the old 1.8.5 apiserver, the whole cluster snaps back into correct operation.

I'll try the upgrade again this evening and I'll send in more detailed logs.

@brendandburns
Copy link
Author

Here's the output from kubeadm

[upgrade/version] You have chosen to change the cluster version to "v1.9.0"
[upgrade/versions] Cluster version: v1.8.5
[upgrade/versions] kubeadm version: v1.9.0
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.9.0"...
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests105458021"
[controlplane] Wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests105458021/kube-apiserver.yaml"
[controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests105458021/kube-controller-manager.yaml"
[controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests105458021/kube-scheduler.yaml"
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests586955648/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]

@brendandburns
Copy link
Author

So I dug into this a little more. I think there are two underlying issues:

  1. The default "time-to-healthy" for the apiserver is too short (at least on my rpis...) it is set to 15 seconds, but it takes longer than that on my node for the apiserver to come up. Changing it to 300 fixed things, this should probably be configurable in kubeadm...

  2. Kubernetes 1.9.0 appears to be right on the edge in terms of memory use for what an rpi stack can handle. At steady state, my master node as ~60Mb of RAM free, and when an APIserver is just coming up and under heavy load from various components, it drops even lower than that.

Not too much that can be done here, I pulled a profile, and though there are some improvements that could help things, there's no low-hanging fruit...

The "right" answer would be to move etcd or some other component to a different node to relieve some of the memory pressure.

@0xmichalis
Copy link
Contributor

Kubernetes 1.9.0 appears to be right on the edge in terms of memory use for what an rpi stack can handle. At steady state, my master node as ~60Mb of RAM free, and when an APIserver is just coming up and under heavy load from various components, it drops even lower than that.

Our docs already suggest using at least 2gb of RAM machines. This is unfortunate for rpis but there are other ARM options like odroid c2 that cover the requirements plus are known to run k8s (and are known to outperform rpis). I am waiting for two rock64 machines with 4gb of RAM each, hoping to get them working, too.

Closing in favor of #644

/close

@0xmichalis
Copy link
Contributor

0xmichalis commented Jan 7, 2018

Also, this may be an issue with the OS you are running. I am also using raspberry pis, running with the stock raspbian lite image, and have performed all upgrades ever since 1.7 successfully (up to the latest - 1.9.1). There are also even more lightweight alternatives like dietpi which I can confirm is working like a dream on a rpi as a k8s node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

4 participants