Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workarounds for the time before kubeadm HA becomes available #546

Closed
mbert opened this issue Nov 16, 2017 · 74 comments
Closed

Workarounds for the time before kubeadm HA becomes available #546

mbert opened this issue Nov 16, 2017 · 74 comments
Assignees
Labels
area/HA documentation/content-gap documentation/improvement help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triaged
Milestone

Comments

@mbert
Copy link
Contributor

mbert commented Nov 16, 2017

The planned HA features in kubeadm are not going to make it into v1.9 (see #261). So what can be done to make a cluster setup by kubeadm sufficiently HA?

This is what it looks like now:

  • Worker nodes can be scaled up to achieve acceptable redundance.
  • Without a working active/active or at least active/passive master setup, master failures are likely to cause significant disruptions.

Hence an active/active or active/passive master setup needs to be created (i.e. mimic what kubeadm would supposedly be doing in the futue):

  1. Replace the local etcd pod by an etcd cluster of min. 2 x number-of-masters size. This cluster could running on the OS rather than in K8s.
  2. Set up more master instances. That's the interesting bit. The Kubernetes guide for building HA clusters (https://kubernetes.io/docs/admin/high-availability/) can be of help to understand what needs to be done. Here I'd like to have simple step-by-step instructions taking into consideration kubeadm-setup particularities in the end.
  3. Not sure whether this is necessary: Probably set up haproxy/keepalived on the master hosts, move the original master's IP address plus SSL termination to it.

This seems achievable if converting the existing master instance to a cluster of masters (2) can be done (the Kubernetes guide for building HA clusters seems to indicate so). Active/active would be not more expensive than active/passive.

I am currently working on this. If I succeed I shall share what I find out here.

@mbert
Copy link
Contributor Author

mbert commented Nov 17, 2017

See also https://github.com/cookeem/kubeadm-ha - this seems to cover what I want to achieve here.

@luxas
Copy link
Member

luxas commented Nov 17, 2017

@mbert we started implementing the HA features and chopped wood on the underlying dependency stack now in v1.9, but it's a short cycle for a big task, so the work will continue in v1.10 as you pointed out.

For v1.9, we will document what you're describing here in the official docs though; how to achieve HA with external deps like setting up a LB

@mbert
Copy link
Contributor Author

mbert commented Nov 17, 2017

Excellent. I am digging through all this right now. I am currently stuck at bootstrapping master 2 and 3, in particular how to configure kubelet and apiserver (how much can I reuse from master 1?) and etcd (I am thinking of using a bootstrap etc on a separate machine for discovery). The guide from the docs is a bit terse when it comes to this.

@kcao3
Copy link

kcao3 commented Nov 17, 2017

@mbert I have been following your comments here and I just want to let you know I followed the guide in docs and was able to stand up a working HA k8s cluster using kubeadm (v1.8.x).

If you are following this setup and you need to bootstrap master 2 and 3, you can reuse almost everything from the first master. You then need to fix up the following configuration files on master 2 and 3 to reflect the current host: /etc/kubernetes/manifests/kube-apiserver.yaml, /etc/kubernetes/kubelet.conf, /etc/kubernetes/admin.conf, and /etc/kubernetes/controller-manager.conf

Regarding etcd, if you follow this guide docs you should stand up an external 3-node etcd cluster that spans across the 3 k8s master nodes.

There is also one 'gotcha' item that has NOT yet been covered in the guide docs.
You can see this issue for detail: cookeem/kubeadm-ha#6

I also asked a few questions related to kubeadm HA from this post: cookeem/kubeadm-ha#7

I really hope that can give me some thoughts on these.

Thank you in advance for your time.

@srflaxu40
Copy link

This is great - definitely need this as I am sure 99% of kubeadm users have a nagging paranoia in the back of their heads about ha of their master(s).

@mbert
Copy link
Contributor Author

mbert commented Nov 17, 2017

@kcao3 thank you. I will look into this all on coming Monday. So I understand that it is OK to use identical certificates on all three masters?

If yes, I assume that next I'll try will be bring up kubelet and apiserver on master 2 and 3 using the configuration from master 1 (with modified IPs and host names in there of course) and then bootstrap the etcd cluster by putting a modified etcd.yaml into /etc/kubernetes/manifests.

Today I ran into problems because the running etcd on master 1 already had cluster information in its data dir which I had to remove first, but I was still running into problems. I guess some good nights of sleep will be helpful.

Once I've got this running I shall document the whole process and publish it.

@mbert
Copy link
Contributor Author

mbert commented Nov 17, 2017

@srflaxu40 yep, and in particular if you have an application that indirectly requires apiserver at runtime (legacy application and service discovery in my case) you cannot afford to lose the only master at any time.

@luxas luxas added this to the v1.9 milestone Nov 17, 2017
@luxas luxas added documentation/content-gap documentation/improvement priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/HA kind/enhancement labels Nov 17, 2017
@mbert
Copy link
Contributor Author

mbert commented Nov 20, 2017

Convert the single-instance etcd to a cluster

I have been able to get replacing the single etcd instance by a cluster in a fresh K8s cluster. The steps are roughly these:

  1. Set up a separate etcd server. This etcd instance is only needed for bootstrapping the cluster. Generate a discovery URL for 3 nodes on it (see https://coreos.com/etcd/docs/latest/op-guide/clustering.html#etcd-discovery).
  2. Copy /etc/kubernetes from master 1 to masters 2 and 3. Substitute host name and IP in /etc/kubernetes/*.* and /etc/kubernetes/manifests/*.*
  3. Create replacements to /etc/kubernetes/manifests/etcd.yaml for all three masters: set all announcement URLs to the respective hosts' primary IPs, all listen URLs to 0.0.0.0, add the discovery URL from step 1. I used the attached JINJA2 template file etcd.yaml.j2.txt together with ansible.
  4. Copy the etcd.yaml replacements to /etc/kubernetes/manifests on all three master nodes.
  5. Now things get time critical. Wait for the local etcd process to terminate, then move /var/lib/etcd/member/wal somewhere else before the new process comes up (otherwise it will ignore the discovery URL).
  6. When the new etcd comes up it will now wait for the remaining two instances to join. Hence, quickly launch kubelet on the other two master nodes.
  7. Follow the etcd container's logs on the first master to see if something went completely wrong. If things are OK, then after some minutes the cluster will be operational again.

Step 5 is somewhat awkward, and I have found that if I miss the right time here or need too much time to get the other two masters to join (step 6) my cluster gets into a state from which it can hardly
recover. When this happened, the simplest solution I found was to shut down kubelet on master 2 and 3, run kubeadm reset on all masters and minions, clear the /var/lib/etcd directories on all masters and set up a new cluster using kubeadm init.

While this works, I'd be interested in possible improvements: Is there any alternative, more elegant and robust approach to this (provided that I still want to follow the approach of running etcd in containers on the masters)?

This comment aims to collect feedback and hints at an early stage. I will post updates on the next steps in a similar way before finally documenting this as a followable guide.

@KeithTt
Copy link

KeithTt commented Nov 21, 2017

@mbert Why do not you use a independent ETCD cluster instead of creating in the k8s?

@mbert
Copy link
Contributor Author

mbert commented Nov 21, 2017

@KeithTt Thank you for your feedback. I was thinking about these here:

  1. Not to use any data.
  2. Stay as close to kubeadm's setup as possible.
  3. Have it supervised by K8s and integrated in whatever monitoring I set up for my system.
  4. Keep the number of services running on the OS low.
  5. It wouldn't make things easier since I'd still have to deal with (4) above.

If an independent etcd cluster's advantages outweigh the above list, I shall be happy to be convinced otherwise.

@luxas
Copy link
Member

luxas commented Nov 21, 2017

@mbert Please make sure you sync with @jamiehannaford on this effort, he's also working on this / committed to making these docs a thing in v1.9

@mbert are you available to join our SIG meeting today 9PT or the kubeadm implementation PR tomorrow 9PT? I'd love to discuss this with you in a call 👍

@mbert
Copy link
Contributor Author

mbert commented Nov 21, 2017

@luxas actually it was @jamiehannaford who asked me to open this issue. Once I have got things running and documented I hope to get lots of feedback from him.
9PT, that's in an hour, right? That would be fine. Just let me know how to connect with you.

@bitgandtter
Copy link

Following guides here and there i manage to do it here is my final steps

@timothysc
Copy link
Member

/cc @craigtracey

@dimitrijezivkovic
Copy link

@mbert

Created - not converted - 3 master-node cluster using kubeadm with 3 node etcd cluster deployed on kubernetes

Here's what I needed to do:

  1. Create 3 master node cluster using kubeadm on barebone servers
  2. Deploy etcd cluster on 3 master nodes using kubeadm
  3. Use non-default pod-network cidr /27

Problems:

  1. Using non-default pod-network cidr is impossible to setup using kubeadm init
  2. No documentation on creating multi-master cluster on barebone exists. Other docs are not as detailed as could be

The way I did it was using kubeadm alpha phase steps, short list follows:

on all master nodes:

  1. Start docker - not kubelet

on masternode1:

  1. Create CA certs
  2. Create apiserver certs with --apiserver-advertise-address, --service-cidr, --apiserver-cert-extra-sans parameters used. Here, really only --apiserver-cert-extra-sans is mandatory.
  3. Create rest of the certs needed
  4. Create kubeconfig and controlplane configs
  5. Edit newly created yaml files in /etc/kubernetes/manifest directory to add any extra options you need.
    For me, here's where I set non-default pod-network CIDR of /27 in kube-control-manager.yaml. Also, remove NodeRestriction from --admission-control
  6. Copy previously prepared yaml file for etd cluster in /etc/kubernetes/manifest directory
  7. Copy /etc/kubernetes directory to rest of the master nodes and edit all the files needed to configure them for masternode2 and masternode3.
  8. Once all files are reconfigured, start kubelet ON ALL 3 MASTER NODES.
  9. Once all nodes are up, taint all master-nodes
  10. Bootstrap all tokens
  11. Create token for joining worker nodes
  12. Edit previously created masterConfig.yaml and update token parameter
  13. Upload masterConfig to kubernetes
  14. Install addons
  15. Generate --discovery-token-ca-cert-hash and add worker nodes

This is really short list of what I did and it can be automated and reproduced in 5 minutes. Also, for me the greatest bonus was I was able to set non-standard pod-network CIDR as I had that restriction of not being able to spare B class IP address range.

If you're interested in more detailed version, please let me know and I'll try and create some docs on how this was done.

@mbert
Copy link
Contributor Author

mbert commented Nov 23, 2017

@dimitrijezivkovic thank you for your comment. I think it would make sense to put all the relevant information together so that one piece of documentation comes out.

I plan to set up a google docs document and start documenting what I did (which is pretty bare-bones). I would then invite others to join and write extensions, corrections, comments?

@mbert
Copy link
Contributor Author

mbert commented Nov 23, 2017

I have now "documented" a very simple setup in form of a small ansible project: https://github.com/mbert/kubeadm2ha

It is of course still work in progress, but it already allows to set up a multi-master cluster without any bells and whistles. I have tried to keep it as simple as possible so that by reading one should be able to find out pretty easily what needs to be done in which order.

Tomorrow I will start writing this up as a simple cooking recipe in a google docs document and invite others to collaborate.

@anguslees
Copy link
Member

anguslees commented Nov 24, 2017

Just to call it out explicitly, there's a bunch of orthogonal issues mashed together in the above conversation/suggestions. It might be useful to break these out separately, and perhaps prioritise some above others:

  • etcd data durability (multi etcd. Requires 2+ etcd nodes)
  • etcd data availability (multi etcd+redundancy. Requires 3+ etcd nodes)
  • apiserver availability (multi apiserver. Requires a loadbalancer/VIP or (at least) DNS with multiple A records)
  • cm/scheduler availability (multi cm/scheduler. Requires 2+ master nodes, and replicas=2+ on these jobs)
  • reboot-all-the-masters recovery (a challenge for self-hosted - requires some form of persistent pods for control plane)
  • kubeadm upgrade support for multi-apiserver/cm-scheduler (varies depending on self-hosted vs non-self-hosted)

Imo the bare minimum we need is etcd durability (or perhaps availability), and the rest can wait. That removes the "fear" factor, while still requiring some manual intervention to recover from a primary master failure (ie: an active/passive setup of sorts).

I think the details of the rest depend hugely on self-hosted vs "legacy", so I feel like it would simplify greatly if we just decided now to assume self-hosted (or not?) - or we clearly fork the workarounds/docs into those two buckets so we don't confuse readers by chopping and changing.

Aside: One of the challenges here is that just about everything to do with install+upgrade changes if you assume a self-hosted+HA setup (it mostly simplifies everything because you can use rolling upgrades, and in-built k8s machinery). I feel that by continually postponing this setup we've actually made it harder for ourselves to reach that eventual goal, and I worry that we're just going to keep pushing the "real" setup back further while we work on perfecting irrelevant single-master upgrades :( I would rather we addressed the HA setup first, and then worked backwards to try to produce a single-host approximation if required (perhaps by packing duplicate jobs temporarily onto the single host), rather than trying to solve single-host and then somehow think that experience will help us with multi-host.

@KeithTt
Copy link

KeithTt commented Nov 24, 2017

@mbert I have achieved the HA proposal by generating the certs manually for each node, and without deleting NodeRestriction, I use haproxy+keepalived as loadbalancer now, maybe lvs+keepalived will be better, I will document the details in this weekend, hope to share with u.

image

@luxas
Copy link
Member

luxas commented Nov 25, 2017

FYI all, @mbert has started working on a great WIP guide for kubeadm HA manually that we'll add to the v1.9 kubeadm docs eventually: https://docs.google.com/document/d/1rEMFuHo3rBJfFapKBInjCqm2d7xGkXzh0FpFO0cRuqg/edit

Please take a look at the doc everyone, and provide your comments. We'll soon-ish convert this into markdown and send as a PR to kubernetes/website.

Thank you @mbert and all the others that are active in thread, this will be a great collaboration!

@andybrucenet
Copy link

Hi @mbert and others - From the past year or so, I have several k8s clusters (kubeadm and otherwise) driven from Cobbler / Puppet on CoreOS and CentOS. However, none of these has been HA.

My next task is to integrate K8s HA and I want to use kubeadm. I'm unsure whether to go with the @mbert's HA setup guide or @jamiehannaford's HA guide.

Also - this morning I read @timothysc's Proposal for a highly available control plane configuration for ‘kubeadm’ deployments. and I like the "initial etcd seed" approach he outlines. However, I don't see that same approach in either @mbert or @jamiehannaford's work. @mbert appears to use a single, k8s-hosted etcd while @jamiehannaford's document documents the classic approach of external etcd (which is exactly what I have used for my other non-HA POC efforts).

What do you all recommend? External etcd, single self-hosted, or locating and using the "seed" etcd (with pivot to k8s-hosted)? If the last - what guide or documentation do you suggest?

TIA!

@jamiehannaford
Copy link
Contributor

jamiehannaford commented Feb 2, 2018

@andybrucenet External etcd is recommended for HA setups (at least at this moment in time). CoreOS has recently dropped support for any kind of self-hosted. It should only really be used for dev, staging or casual clusters.

@mbert
Copy link
Contributor Author

mbert commented Feb 2, 2018

@andybrucenet Not quite - I am using an external etcd cluster just like @jamiehannaford proposes in his guide. Actually the approaches described in our respective documents should be fairly similar. It is based on setting up the etcd cluster you feel you need and then have kubeadm use it when bootstrapping the Kubernetes cluster.

I am currently more or less about to finish my guide and the ansible-based implementation by documenting and implementing a working upgrade procedure - that (and some bugfixes) should be done sometime next week.

Not quite sure whether there will be any need to further transfer my guide into yours, @jamiehannaford what do you think?

@mbert
Copy link
Contributor Author

mbert commented Feb 5, 2018

Actually the hostname-override was unnecessary. When running kubeadm upgrade apply, some default settings overwrite my adaptations, e.g. NodeRestriction gets re-activated (also my scaling of Kube DNS instances gets reset, but this was of course not a show stopper here). Patching the NodeRestriction admission rule out of /etc/kubernetes/manifests/kube-apiserver.yaml did the trick.

@mbert
Copy link
Contributor Author

mbert commented Feb 5, 2018

I have now written a chapter on upgrading HA clusters to my HA setup guide.

Also I have added code for automating this process to my ansible project on github. Take a look into the README.md file there for more information.

@mattkelly
Copy link

@mbert for the upgrade process you've outlined, what are the exact reasons for manually copying the configs and manifests from /etc/kubernetes on the primary master to the secondary masters rather than simply running kubeadm upgrade apply <version> on the secondary masters as well?

@mbert
Copy link
Contributor Author

mbert commented Feb 9, 2018

@mattkelly It seemed rather dangerous to me.
Since the HA cluster's masters use an active/passive setup, but kubeadm knows about only one master I found running it again on a different master risky.
I may be wrong though.

@mbert
Copy link
Contributor Author

mbert commented Feb 9, 2018

Replying to myself: Having looked at Jamie's guide on kubernetes.io, running kubeadm on the masters may work, even when setting up the cluster. I'll try this out next week and probably make some changes to my documents accordingly.

@mattkelly
Copy link

mattkelly commented Feb 9, 2018

FWIW, running kubeadm on the secondary masters seems to have worked just fine for me (including upgrade) - but I need to better understand the exact risks at each stage. I've been following @jamiehannaford's guide which is automated by @petergardfjall's hakube-installer (no upgrade support yet though, so I tested that manually).

Edit: Also important to note is that I'm only testing on v1.9+. Upgrade was from v1.9.0 to v1.9.2.

@mbert
Copy link
Contributor Author

mbert commented Feb 12, 2018

I have now followed the guide on kubernetes.io that @jamiehannaford created, i.e. ran kubeadm init on the all master machines (after having copied /etc/kubernetes/pki/ca.* to the secondary masters). This works just fine for setting up the cluster. In order to be able to upgrade to v1.9.2 I am setting up v1.8.3 here.

Now I am running into trouble when trying to upgrade the cluster: Running kubeadm upgrade apply v1.9.2 on the first master fails:

[controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests872757515/kube-controller-manager.yaml"
[controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests872757515/kube-scheduler.yaml"
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests647361774/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]

This step fails reproducably (I always start from the scratch, i.e. remove all configuration files plus etcd data from all nodes before starting a new setup).

I tried out several variations, but no success:

  • Have kubelet use the local API Server instance or the one pointed to by the virtual IP
  • have kube-proxy use the local API Server instance or the one pointed to by the virtual IP

I have attached some logs. However I cannot really find any common pattern that would explain this problem to me. Maybe it is something I just don't know?

upgrade-failed-proxy-on-vip.log
upgrade-failed-proxy-and-kubelet-on-vip.log
upgrade-failed-proxy-and-kubelet-on-local-ip.log

@mbert
Copy link
Contributor Author

mbert commented Feb 12, 2018

Having tried out another few things it boils down to the following:

  • Updating the master which was setup last (i.e. the one on which kubeadm init was run last when setting up the cluster) works.
  • I can get the other nodes working, too, if I edit configmap/kubeadm-config and change the value for MasterConfiguration.nodeName in there to the respective master's host name or simply delete that line.

Others like @mattkelly have been able to perform the upgrade without editing configmap/kubeadm-config, hence the way I set things up must be somehow different.

Anybody got a clue what I should change, so that upgrading works without this (rather dirty) trick?

I have tried upgrading from both 1.8.3 and 1.9.0 to 1.9.2, with the same result.

@mattkelly
Copy link

mattkelly commented Feb 13, 2018

@mbert I'm now reproducing your issue from a fresh v1.9.0 cluster created using hakube-installer. Trying to upgrade to v1.9.3. I can't think of anything that has changed with my workflow. I'll try to figure it out today.

I verified that deleting the nodeName line from configmap/kubeadm-config for each subsequent fixes the issue.

@mbert
Copy link
Contributor Author

mbert commented Feb 13, 2018

Thank you, that's very helpful. I have now added patching configmap/kubeadm-config to my instructions.

@mattkelly
Copy link

mattkelly commented Feb 13, 2018

@mbert oops, I figured out the difference :). For previous upgrades I had been providing the config generated during setup via --config (muscle memory I guess). This is why I never needed the workaround. I believe that your workaround is more correct in case the cluster has changed since init time. It would be great to figure out how to avoid that hack, but it's not too bad in the meantime - especially compared to all of the other workarounds.

mbert added a commit to mbert/website that referenced this issue Feb 16, 2018
- In the provided configuration file for `kubeadm init` the value for
`apiserver-count` needs to be put in quotes.
- In addition to /etc/kubernetes/pki/ca.* also /etc/kubernetes/pki/sa.*
need to be copied to the additional masters. See [this
comment](kubernetes/kubeadm#546 (comment)) by
@petergardfjall for details.
mbert added a commit to mbert/website that referenced this issue Feb 19, 2018
- In the provided configuration file for `kubeadm init` the value for
`apiserver-count` needs to be put in quotes.
- In addition to /etc/kubernetes/pki/ca.* also /etc/kubernetes/pki/sa.*
need to be copied to the additional masters. See [this
comment](kubernetes/kubeadm#546 (comment)) by
@petergardfjall for details.
k8s-ci-robot pushed a commit to kubernetes/website that referenced this issue Feb 20, 2018
- In the provided configuration file for `kubeadm init` the value for
`apiserver-count` needs to be put in quotes.
- In addition to /etc/kubernetes/pki/ca.* also /etc/kubernetes/pki/sa.*
need to be copied to the additional masters. See [this
comment](kubernetes/kubeadm#546 (comment)) by
@petergardfjall for details.
bsalamat pushed a commit to bsalamat/kubernetes.github.io that referenced this issue Feb 23, 2018
- In the provided configuration file for `kubeadm init` the value for
`apiserver-count` needs to be put in quotes.
- In addition to /etc/kubernetes/pki/ca.* also /etc/kubernetes/pki/sa.*
need to be copied to the additional masters. See [this
comment](kubernetes/kubeadm#546 (comment)) by
@petergardfjall for details.
@ReSearchITEng
Copy link

Hello,
Will kubeadm 1.10 remove any of the pre-steps/workarounds currently required for HA in 1.9 ?
E.g. the manual creation of a bootstrap etcd, generation of etcd keys, etc?

tehut pushed a commit to tehut/website that referenced this issue Mar 8, 2018
- In the provided configuration file for `kubeadm init` the value for
`apiserver-count` needs to be put in quotes.
- In addition to /etc/kubernetes/pki/ca.* also /etc/kubernetes/pki/sa.*
need to be copied to the additional masters. See [this
comment](kubernetes/kubeadm#546 (comment)) by
@petergardfjall for details.
@timothysc timothysc modified the milestones: v1.11, v1.10 Apr 7, 2018
@timothysc
Copy link
Member

Closing this item as 1.10 doc is out and we will be moving to further the HA story in 1.11

/cc @fabriziopandini

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/HA documentation/content-gap documentation/improvement help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triaged
Projects
None yet
Development

No branches or pull requests