Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-apiserver not given enough time to start #6702

Closed
dobesv opened this issue Mar 29, 2019 · 3 comments
Closed

kube-apiserver not given enough time to start #6702

dobesv opened this issue Mar 29, 2019 · 3 comments

Comments

@dobesv
Copy link

dobesv commented Mar 29, 2019

1. What kops version are you running? The command kops version, will display
this information.

Version 1.11.1 (git-0f2aa8d30)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-25T15:53:57Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.8", GitCommit:"4e209c9383fa00631d124c8adcc011d617339b3c", GitTreeState:"clean", BuildDate:"2019-02-28T18:40:05Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops upgrade cluster --yes
kops update cluster --yes
kops rolling-update cluster --yes

5. What happened after the commands executed?

Masters did not come online and ready reliably after restart or on first start. Sometimes they do and sometimes they don't.

6. What did you expect to happen?

Masters come online without problems.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

https://gist.github.com/dobesv/12c2826b9b658b2ead290eca0c63acdf

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

Excerpt of kube-apiserver.log:

https://gist.github.com/dobesv/d1897bf839dc095c074cba6612ad246e

9. Anything else do we need to know?

After some poking around, the issue seems to be that kube-apiserver takes more than 30 seconds to fully start up, sometimes. The livenessProbe is configured with a 15 seconds delay and 15 seconds timeout. So, the kube-apiserver is marked for termination just before it actually starts up. It gets a 30 second grace period before it is terminated by kubelet.

I am not quite sure whether it is normal for kube-apiserver to take >30s to start up, perhaps there's something wrong with my setup that is slowing that process down unreasonably.

However, if I ssh into the server and increase the timeouts on the livenessProbe for kube-apiserver to 60 for both instead of 15 the master comes online OK.

I guess my proposal here is to increase the default livenessProbe initialDelay (maybe to 60 seconds) or make it configurable.

See Also

@k8s-ci-robot
Copy link
Contributor

@alok87: I was wondering why the frisbee was getting bigger, then it hit me.

In response to this:

/joke

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dobesv
Copy link
Author

dobesv commented Mar 31, 2019

Update: I increased the ec2 instance size from m3.medium to m5.large and that fixed the issue. I guess the instance was a bit overloaded on startup and that is why kube-apiserver took longer to start. Although this is a potential issue I wonder whether it is worth addressing or if people just need to use bigger instances. If the current default instance type for masters is m3.medium perhaps just changing that to m4.large would be acceptable. m3.medium instances are no longer considered "current" by AWS.

@justinsb
Copy link
Member

justinsb commented Apr 7, 2019

Thanks, @dobesv - I think you're right here. (I was actually involved in the upstream issue as well, as you can see!)

I also think you're right that we should look at changing to a more "modern" instance type; it is technically a behavioural change, but it shouldn't be a breaking change, and I think it is in keeping with the idea that if you don't specify an instance type it means "choose for me", rather than "use t2.medium".

justinsb added a commit to justinsb/kops that referenced this issue Apr 7, 2019
Fix kubernetes#6702

Parallel to upstream issue #71054
justinsb added a commit to justinsb/kops that referenced this issue Apr 7, 2019
Fix kubernetes#6702

Parallel to upstream issue #71054
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants