kube-apiserver not given enough time to start #6702

dobesv · 2019-03-29T04:15:33Z

1. What kops version are you running? The command kops version, will display
this information.

Version 1.11.1 (git-0f2aa8d30)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-25T15:53:57Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.8", GitCommit:"4e209c9383fa00631d124c8adcc011d617339b3c", GitTreeState:"clean", BuildDate:"2019-02-28T18:40:05Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops upgrade cluster --yes
kops update cluster --yes
kops rolling-update cluster --yes

5. What happened after the commands executed?

Masters did not come online and ready reliably after restart or on first start. Sometimes they do and sometimes they don't.

6. What did you expect to happen?

Masters come online without problems.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

https://gist.github.com/dobesv/12c2826b9b658b2ead290eca0c63acdf

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

Excerpt of kube-apiserver.log:

https://gist.github.com/dobesv/d1897bf839dc095c074cba6612ad246e

9. Anything else do we need to know?

After some poking around, the issue seems to be that kube-apiserver takes more than 30 seconds to fully start up, sometimes. The livenessProbe is configured with a 15 seconds delay and 15 seconds timeout. So, the kube-apiserver is marked for termination just before it actually starts up. It gets a 30 second grace period before it is terminated by kubelet.

I am not quite sure whether it is normal for kube-apiserver to take >30s to start up, perhaps there's something wrong with my setup that is slowing that process down unreasonably.

However, if I ssh into the server and increase the timeouts on the livenessProbe for kube-apiserver to 60 for both instead of 15 the master comes online OK.

I guess my proposal here is to increase the default livenessProbe initialDelay (maybe to 60 seconds) or make it configurable.

See Also

apiserver fails to start because livenessprobe is too aggressive kubeadm#413 - similar issue reported against kubeadm
increase the liveness probe delay for GCE e2e tests to avoid premature teardown kubernetes#71054 - similar issue in kubernetes test suites

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2019-03-31T11:54:34Z

@alok87: I was wondering why the frisbee was getting bigger, then it hit me.

In response to this:

/joke

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dobesv · 2019-03-31T23:28:48Z

Update: I increased the ec2 instance size from m3.medium to m5.large and that fixed the issue. I guess the instance was a bit overloaded on startup and that is why kube-apiserver took longer to start. Although this is a potential issue I wonder whether it is worth addressing or if people just need to use bigger instances. If the current default instance type for masters is m3.medium perhaps just changing that to m4.large would be acceptable. m3.medium instances are no longer considered "current" by AWS.

justinsb · 2019-04-07T18:41:01Z

Thanks, @dobesv - I think you're right here. (I was actually involved in the upstream issue as well, as you can see!)

I also think you're right that we should look at changing to a more "modern" instance type; it is technically a behavioural change, but it shouldn't be a breaking change, and I think it is in keeping with the idea that if you don't specify an instance type it means "choose for me", rather than "use t2.medium".

Fix kubernetes#6702 Parallel to upstream issue #71054

justinsb added a commit to justinsb/kops that referenced this issue Apr 7, 2019

Increase apiserver timeout to 45 seconds

c7b921f

Fix kubernetes#6702 Parallel to upstream issue #71054

justinsb mentioned this issue Apr 7, 2019

Increase apiserver timeout to 45 seconds #6743

Merged

justinsb added a commit to justinsb/kops that referenced this issue Apr 7, 2019

Increase apiserver timeout to 45 seconds

2766e8d

Fix kubernetes#6702 Parallel to upstream issue #71054

k8s-ci-robot closed this as completed in #6743 Apr 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-apiserver not given enough time to start #6702

kube-apiserver not given enough time to start #6702

dobesv commented Mar 29, 2019 •

edited

k8s-ci-robot commented Mar 31, 2019

dobesv commented Mar 31, 2019

justinsb commented Apr 7, 2019

kube-apiserver not given enough time to start #6702

kube-apiserver not given enough time to start #6702

Comments

dobesv commented Mar 29, 2019 • edited

k8s-ci-robot commented Mar 31, 2019

dobesv commented Mar 31, 2019

justinsb commented Apr 7, 2019

dobesv commented Mar 29, 2019 •

edited