Production Quality Deployment #340

colhom · 2016-03-22T20:00:08Z

pieterlange · 2016-03-23T10:04:57Z

All good points. Very happy to see you working on making it easier to deploy to existing VPCs!

I'm currently using https://github.com/MonsantoCo/etcd-aws-cluster/ to do bootstrapping to a dedicated etcd cluster (discovery happens by specifying the ASG for the etcd cluster and assigning appropriate IAM describe roles)

I'm not too sure about automatically provisioning an AWS elasticsearch cluster. The AWS native cluster is stuck on a very old ES version. Maybe this'll become a whole lot easier once kubernetes EBS support matures a bit, and we could just host it in the provisioned kube cluster.

bfallik · 2016-03-23T22:20:20Z

Very eager for this work and happy to help if I can. I can't advocate for deploying a k8s+coreos cluster in AWS at work until I have a good answer for many of the items on this list, especially the upgrade path and high availability.

colhom · 2016-03-24T03:09:30Z

@bfallik do you want to work on any of the bullet points in particular?

bfallik · 2016-03-24T12:37:13Z

@colhom nothing in particular though I suppose I'm most interested in the cluster upgrades and ELB+ASG work

pieterlange · 2016-03-24T12:58:57Z

@colhom if you like the discovery method used for etcd i think i can help with that.

colhom · 2016-03-25T19:43:50Z

@pieterlange putting etcd in an autoscaling group worries me as of now. The monsantoCo script seems kind of rickety: for example, does not support scaling down the cluster as far as I can tell.

drewblas · 2016-04-05T13:04:11Z

This list is fantastic. It represents exactly what we need in order to consider Kubernetes+CoreOS production ready for our use. I can't wait to see these executed!

krancour · 2016-04-08T16:59:05Z

This is just what I have always wanted!

Currently it's on the user to create a record, via Route53 or otherwise, in order to make the controller IP accessible via externalDNSName. This commit adds an option to automatically create a Route53 record in a given hosted zone. Related to: coreos#340, coreos#257

pieterlange · 2016-04-18T08:28:07Z

@colhom i suggest adding #420 to the list as well as even the deployment guidelines point it out as a production deficiency

You are right about having etcd in an autoscaling group of course. I'm running a dedicated etcd cluster across all available zones, which feels a little bit safer but is still a hazard as i'm depending on a majority of the etcd cluster to stay up & reachable. Not sure what the answer is here.

I'm spending some time on HA controllers myself, i'll try to make whatever adjustments i make mergeable.

mumoshu · 2016-04-26T02:46:07Z

@colhom Hi, thanks for maintaining this project :)

Enable decommissioning of kubelets when instances are rotated out of the ASG
Automatically remove nodes when instances are rotated out of ASG

Would you mind sharing me what you think about requirements for/how to do this?
Running something like kubectl drain against a node when it is scheduled to detach/terminate from an ASG makes sense for you?

If so, I guess I can contribute on that (auto scaling lifecycle hooks + sqs + tiny golang app container which runs kubectl drain or else on new sqs message).

colhom · 2016-04-26T02:54:09Z

I was thinking that nodes would trigger kubectl drain via systemd service
on shutdown.
On Apr 25, 2016 7:46 PM, "KUOKA Yusuke" notifications@github.com wrote:

@colhom https://github.com/colhom Hi, thanks for maintaining this
project :)

Enable decommissioning of kubelets when instances are rotated out of the
ASG
Automatically remove nodes when instances are rotated out of ASG

Would you mind sharing me what you think about requirements for/how to do
this?
Running something like kubectl drain against a node when it is scheduled
to detach/terminate from an ASG makes sense for you?

If so, I guess I can contribute on that (auto scaling lifecycle hooks +
sqs + tiny golang app container which runs kubectl drain or else on new
sqs message).

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#340 (comment)

mumoshu · 2016-04-29T09:43:42Z

@colhom Sounds much better than my idea in regard to simplicity!

I'd like to contribute to that but I'm not sure what to include in the kubeconfig used by worker's `kubectl'.
Would you mind sharing me your ideas?

(Also, we may want to create a separate issue for this)

mumoshu · 2016-05-03T01:28:25Z

@colhom

Set up controller and worker AutoscalingGroups to recover from ec2 instance failures

I believe this is solved for workers by the PR #439
I appreciate it if you could update your original comment on this issue to reference that? (Just to track how things are going. I'm a huge fan of this project and wishing this issue to be solved 😄 )

colhom · 2016-05-03T03:14:51Z

@mumoshu that excerpt is referring to the fact that our controllers are not in an autoscaling group, and if the instance is killed the control plane will be down pending human intervention. I do believe the worker pool ASG should recover from an instance failure on it's own, though. Will edit that line to just reference the controller.

mumoshu · 2016-05-05T04:41:59Z

@colhom I have just subimitted #465 for #340 (comment)
I appreciate it if you could look into it 🙇

mumoshu · 2016-05-05T10:38:53Z

FYI, regarding this:

Dedicated etcd cluster in an ASG, behind an ELB

in addition to MonsantoCo/etcd-aws-cluster @pieterlange mentioned, I have recently looked into crewjam/etcd-aws with the blog post. It seems to be a great work.

enxebre · 2016-06-07T12:24:00Z

We've been working on https://github.com/Capgemini/kubeform which is based on terraform, ansible + CoreOS and it's inline with some of the thinking here. Happy to help contribute to something here.

harsha-y · 2016-06-10T02:54:26Z

When multi-az support was announced combined with the checked off #346 in the list mentioned above we got excited and tried to deploy a kube-aws cluster without actually verifying that existing subnets are supported. Obviously we ran into issues. What we ended up doing was to take the CF template output after running kube-aws init/kube-aws render, edit the template to include our existing subnets and launch the cluster aws cloudformation. After a little bit of hacking we did end up with a working cluster in our existing subnets. But this solution seems brittle.

Here are a few things IMHO that would make the cluster launch more "productionized":

Support for existing subnets
Private subnets/instances with no public ip address
- Leave AWS managed nat gateway attachment/routing to the end-users?
- This was partially accomplished in the cluster we launched but the controller still assigned itself an EIP in a private subnet
Clear upgrade paths
Docker pull through cache registry on the master
More granularity around the different addons
- Would most definitely include SkyDNS and Kubernetes Dashboard by default but...
- Folks might have alternate solutions around Calico, Heapster(Prometheus/Sysdig) and Fluentd-ELK(Fluentd-Graylog2) - these should be optional

May be this list should be split into must-haves vs nice-to-haves? Or better, layers of cloudformation templates? (I might be over-simplifying things here, but you get the idea)
Something along the lines of -

kube-aws up cluster
kube-aws up calico
kube-aws up logging

When we initially launched our k8s clusters last year, there were very few solutions that solved some of the requirements we had. So we went ahead and wrote a lengthy but working cloudformation template and that solved most of our reqs. But we ended up with a template that was hard to maintain and a cluster that needed to be replaced whenever we wanted to upgrade/patch - which doesn't really work well when you're running production workloads unless you have some serious orchestration around the cluster. The current toolset(kargo/kube-aws) around CoreOS/Kubernetes still leave much to be desired.

igalbk · 2016-07-12T21:43:14Z

@harsha-y Thank you for this info
We are trying to do the same and modify the template output to make the cloudformation to use an existing subnet.
Can you please share how exactly to do it, what exactly to modify?

sdouche · 2016-07-12T22:40:25Z

Hi @igalbk
Today I tried successfully to install CoreOS-Kubernetes on existing subnet (only one, it's a POC) and existing IAM Roles. What I do:

remove in the CF template the creation of subnet0
remove in the CF template the creation of IAM*
remove in the CF template the creation of EIPController
remove in the CF template the creation of RouteTableAssociation
add subnet, IAMInstanceProfile* as parameters
Use the privateIP of the controller for the DNS record

And in cluster.go (quick & dirty, I'm sorry):

-       if err := c.ValidateExistingVPC(*existingVPC.CidrBlock, subnetCIDRS); err != nil {
-               return fmt.Errorf("error validating existing VPC: %v", err)
-       }
+       //if err := c.ValidateExistingVPC(*existingVPC.CidrBlock, subnetCIDRS); err != nil {
+       //      return fmt.Errorf("error validating existing VPC: %v", err)
+       //}

        return nil
 }
@@ -266,7 +266,7 @@ func (c *Cluster) Info() (*Info, error) {
        cfSvc := cloudformation.New(c.session)
        resp, err := cfSvc.DescribeStackResource(
                &cloudformation.DescribeStackResourceInput{
-                       LogicalResourceId: aws.String("EIPController"),
+                       LogicalResourceId: aws.String("InstanceController"),
                        StackName:         aws.String(c.ClusterName),
                },
        )

It's good for the stack creation but I've an error with the kubernetes-wrapper (need to investigate).

sdouche · 2016-07-12T22:44:58Z

@igalbk I can send you patches if you want. Thanks for your support.

igalbk · 2016-07-13T21:19:26Z

Thank you @sdouche
It will be great if you can send me or share more details.
I'm successfully can create a cloudformation stack after modifying the "kuve-aws up --exort" stack in the same VPC+subnet, but the cluster itsefv is not functional because flannel.service is not starting, maybe I did something wring... but I don"t know what. BTW, We don"t need to use an existing IAM,
Is it a must? and why did you had to remove the EIPController?

sdouche · 2016-07-13T21:57:25Z

I can't creating roles (created and validated by ops). Thanks to AWS to allow to create roles with more power than the creator of the role.
ElasticIP works only on public subnets. You can't use it for private clusters.
Have you set a /16 for the podsCIDR ? Flannel uses a /24 for each node. Hard to say what's wrong w/o logs. I think it's not a CF issue, more a k8s one.

sdouche · 2016-07-20T22:41:28Z

update: I can't create ELB (see kubernetes/kubernetes#29298) with an existing subnet.

EDIT: Must add service.beta.kubernetes.io/aws-load-balancer-internal annotation to create an internal ELB.

detiber · 2016-07-21T16:22:57Z

@dgoodwin not sure if you've seen this.

colhom · 2016-08-02T01:21:17Z

Update here on work that is closing in on being ready for review:

discrete etcd cluster w/ TLS PR - Discrete etcd cluster #544
HA/cross zone control plane - [WIP] Ha control plane #596
Cluster upgrades... - coming next!

colhom · 2016-08-09T23:59:11Z

Cluster upgrade PR is in #608

AlmogBaku · 2016-09-02T16:02:47Z

Heapster now fully support ElasticSearch Sink(also hosted ES clusters on AWS):
kubernetes-retired/heapster#733 MERGED
kubernetes-retired/heapster#1260 MERGED
kubernetes-retired/heapster#1276 MERGED

Documentation
https://github.com/kubernetes/heapster/blob/master/docs/sink-configuration.md#aws-integration

AlmogBaku · 2016-09-02T16:04:42Z

Fluentd integration depends on #650 + making an image that preconfigured for the cluster(we can contribute that; cc @Thermi )

AlmogBaku · 2016-09-26T01:23:50Z

kubernetes-retired/heapster#1313 this PR will fix the ES Sink compatability.. however since AWS doesn't allow "scripted fields" it's still impossible to calculate usage rate of resources as percentage out of the capacity.

aaronlevy · 2016-11-17T01:49:44Z

The kube-aws tool has been moved to its own top-level directory @ https://github.com/coreos/kube-aws

If this issue still needs to be addressed, please re-open the issue under the new repository.

drewblas · 2016-11-17T01:50:58Z

No worries, nobody cares about production quality deploys. That'd be ridiculous...

colhom · 2016-11-17T01:59:00Z

@drewblas the project has simply moved to a new repo- where significant progress has been made in merging functionality towards these goals in the last few weeks.

aaronlevy · 2016-11-17T02:03:41Z

@drewblas Sorry I was copy / pasting a generic notice. As @colhom said -- a lot of the work towards these goals is being merged there.

colhom mentioned this issue Mar 22, 2016

UX regarding DNS pointing to kube api node #257

Closed

colhom added kind/enhancement platform/AWS labels Apr 12, 2016

colhom mentioned this issue Apr 18, 2016

upgrade questions #417

Open

tomdee mentioned this issue Apr 19, 2016

Don't turn off CoreOS updates #89

Open

mumoshu mentioned this issue May 5, 2016

kube-aws: Drain nodes before shutting them down #465

Closed

cgag mentioned this issue Jun 10, 2016

instance cidr conflicts with existing subnet cidr? #538

Closed

harsha-y mentioned this issue Jun 10, 2016

kube-aws up fails using multiple AZ/Subnets #510

Closed

aaronlevy mentioned this issue Jun 27, 2016

kube-aws SSL certs are not production worthy #420

Closed

pieterlange mentioned this issue Jul 3, 2016

tool for updating TLS assets in existing kube-aws clusters #561

Open

sdouche mentioned this issue Jul 20, 2016

Explain why not subnets are found for creating an ELB on AWS kubernetes/kubernetes#29298

Closed

pieterlange mentioned this issue Jul 26, 2016

Create easy way to deploy @justinsb's DNS controller for kube-aws #587

Closed

colhom mentioned this issue Aug 9, 2016

Cluster upgrades #608

Closed

cknowles mentioned this issue Oct 17, 2016

Ability to use existing route tables for controller and workers #716

Closed

This was referenced Oct 29, 2016

Production Quality Deployment kubernetes-retired/kube-aws#9

Closed

move kube-aws development to dedicated repository #751

Merged

aaronlevy closed this as completed Nov 17, 2016

cmcconnell1 mentioned this issue Nov 17, 2016

kube-aws: decrypt-tls-assets.service failed in a controller node #675

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production Quality Deployment #340

Production Quality Deployment #340

colhom commented Mar 22, 2016 •

edited by brianredbeard

pieterlange commented Mar 23, 2016

bfallik commented Mar 23, 2016

colhom commented Mar 24, 2016

bfallik commented Mar 24, 2016

pieterlange commented Mar 24, 2016

colhom commented Mar 25, 2016

drewblas commented Apr 5, 2016

krancour commented Apr 8, 2016

pieterlange commented Apr 18, 2016

mumoshu commented Apr 26, 2016

colhom commented Apr 26, 2016

mumoshu commented Apr 29, 2016

mumoshu commented May 3, 2016

colhom commented May 3, 2016

mumoshu commented May 5, 2016

mumoshu commented May 5, 2016

enxebre commented Jun 7, 2016

harsha-y commented Jun 10, 2016

igalbk commented Jul 12, 2016

sdouche commented Jul 12, 2016 •

edited

sdouche commented Jul 12, 2016

igalbk commented Jul 13, 2016

sdouche commented Jul 13, 2016 •

edited

sdouche commented Jul 20, 2016 •

edited

detiber commented Jul 21, 2016

colhom commented Aug 2, 2016

colhom commented Aug 9, 2016

AlmogBaku commented Sep 2, 2016

AlmogBaku commented Sep 2, 2016

AlmogBaku commented Sep 26, 2016

aaronlevy commented Nov 17, 2016

drewblas commented Nov 17, 2016

colhom commented Nov 17, 2016

aaronlevy commented Nov 17, 2016 •

edited

Production Quality Deployment #340

Production Quality Deployment #340

Comments

colhom commented Mar 22, 2016 • edited by brianredbeard

pieterlange commented Mar 23, 2016

bfallik commented Mar 23, 2016

colhom commented Mar 24, 2016

bfallik commented Mar 24, 2016

pieterlange commented Mar 24, 2016

colhom commented Mar 25, 2016

drewblas commented Apr 5, 2016

krancour commented Apr 8, 2016

pieterlange commented Apr 18, 2016

mumoshu commented Apr 26, 2016

colhom commented Apr 26, 2016

mumoshu commented Apr 29, 2016

mumoshu commented May 3, 2016

colhom commented May 3, 2016

mumoshu commented May 5, 2016

mumoshu commented May 5, 2016

enxebre commented Jun 7, 2016

harsha-y commented Jun 10, 2016

igalbk commented Jul 12, 2016

sdouche commented Jul 12, 2016 • edited

sdouche commented Jul 12, 2016

igalbk commented Jul 13, 2016

sdouche commented Jul 13, 2016 • edited

sdouche commented Jul 20, 2016 • edited

detiber commented Jul 21, 2016

colhom commented Aug 2, 2016

colhom commented Aug 9, 2016

AlmogBaku commented Sep 2, 2016

AlmogBaku commented Sep 2, 2016

AlmogBaku commented Sep 26, 2016

aaronlevy commented Nov 17, 2016

drewblas commented Nov 17, 2016

colhom commented Nov 17, 2016

aaronlevy commented Nov 17, 2016 • edited

colhom commented Mar 22, 2016 •

edited by brianredbeard

sdouche commented Jul 12, 2016 •

edited

sdouche commented Jul 13, 2016 •

edited

sdouche commented Jul 20, 2016 •

edited

aaronlevy commented Nov 17, 2016 •

edited