Skip to content

Setting up high performance and high available Kubernetes cluster for production use

Notifications You must be signed in to change notification settings

xogoodnow/Kubernetes_Cluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Production Ready K8s cluster

A K8S cluster implementation ready for heavy production load

Components Used

Name : Version Purpose Alternatives Advantages
Terraform 1.5.4 Docs Hardware Provisioner
Initial Setup
Salt Anible 1. Easy syntax
2. Sufficient community and documentation
3. Much better suited for hardware provisioning
Hetzner Provider 1.42.1 Docs Deploying servers Vultr DigitalOcean 1. Cheaper :)
2. Good community overlooking provider
Ansible 1.5.6 Docs Automating Tasks Salt 1. No footprint on target hosts
Helm 3.12.2 Docs Resource Controll Non-I-know-Of :)
S3cmd 2.3.0 Docs Backup on 3s Cyberduck Rclone 1. Easey to setup
2. Huge community and documentation
3. Python (Easy to customize if needed)
K8s 1.25.0-00 Docs Orchestrator Docker Swarm Nomad 1. k8s to swarm is like ocean to a puddle
2. Nomad is quite greate (Needs R&D)
Cri-o 1.24.6 Docs Container Runtime Interface Containerd 1. Very efficient
2. Supported by k8s sigs
3. Very light (But lacks some functionality)
Nginx Ingress Controller Chart: 0.18.1 Docs Ingress Controller Traefik Api Gateway 1. Much faster that Traefik
2. Proven on production
3. Good community and documentation
4. Api Gateway is quite amazing (Needs R&D)
Openebs Cstore 3.8.0 Docs Storage Solution Openebs (Jiva) Ceph fs Rook fs Longhorn 1. Much less complex than ceph on both setup and management
2. Good community and documentation
3. Longhorn is quite nice (Needs R&D)
Ubuntu 22.04 Docs Operating system Debian Centos 1. Bigger community
2. Faster releases than debian
3. Bigger community than any other OS
4. Not cash grapping like centos (Yet :))
Cert Manager Chart: v1.12.3 Docs Certificate Controller Non-I-know-Of :)
Fluentbit Chart: 0.37.1Docs Log Collctor/Shipper Logstash fluentd 1. No seperate component for shipper and collector
2. No extra dependency
3. Very efficient (faster than fluentd)
4. Almost zero foot print (Comparing to alternatives)
5. Much easier to setup and manage
6. Good number of useful plugins
Elaticsearch Chart: 2.9.0 Docs Log Analysis Loki 1. More rigorious indexing
2. Loki needs more R&D
Kube Prometheus Stack Docs Monitoring Prometheus+Grafana 1. One single chart (so easier to manage and setup)
2. Preconfigured for k8s components
Haproxy latest Docs Control plain loadbalancer CDN 1. Easier to setup
2. Custome health check rules
3. Since cluster is initiated on domain, CDN can be used too
Calico 3.26.1 Docs Container Network Interface Flannel Cillium Canal 1. Support for network policy
2. Multi AZ support
3. Quite easy to setup
4. Great documentation and community
5. eFFICIENT l3 NETWORK
6. Configureable BGP (bird agent)
Kibana 8.9.1 Docs Log Visualizer Grafana Datadog 1. Free (comparing to datadog which is awsome)
2. Customized specifically for ealstic search so they are much more compatible
3. Easier to setup
4. Very light weight

Before you begin

Note Each ansible role has a general and a specific Readme file. It is strongly encouraged to read them before firing off

p.s: Start with the readme file of main setup playbook

  • Create an Api on hetzner
  • Create a server as terraform and ansible provisioner (Needless to say that ansible and terraform must be installed)
  • Clone the project
  • In modular_terraform folder create a terraform.tfvars
    • The file must contain the following variables
      • hcloud_token "APIKEY"
      • image_name = "ubuntu-22.04"
      • server_type = "cpx31"
      • location = "hel1"
  • Run terraform init to create the required lock file
  • Before firing off, run terraform plan to see if everything is alright
  • Run terraform apply
  • Go drink a cup of coffe and come back in 30 minutes or so (Hopefully everything must be up and running by then (: )

Known issues

  • When creating SDS, Coredns and webhook addmision controller must be deleted other wise CSPC would not be applied correctly
  • No alert manager
  • Haproxy could be a single point of failure (if ther is no backup (namely CDN))
  • Audit policy is way too general which would result in huge overhead
  • Terraform is limited to Hetzner
  • Communication is over public network (Encrypted but still vulnerable to Zero-day exploits since its observable)
    • Firewall policies minimize the observable scope
  • Since updating procedure on k8s is differnt from version to version, currently, only update form V1.25 to 1.26 is supported

Work flow

  • Run the following command for terraform to install dependencies and create the lock file
terraform init

image

  • Run the following command and check if there are any problems with terraform
terraform plan

image

  • Apply terraform modules and get started
terraform apply

image

Note Add haproxy ip as the A record for control plain record Add worker IP addreses for Grafana, Prometheus and kibana

  • Check if Prometheus works
  • Note

Check if all metrics are exposed properly image

  • Check if Grafana works
  • Note

All dashboard are provisioned in config map To add custom dashbaord on load, add it to dashbaord as a .json file. It would automatically be loaded to Grafana

image

  • Check if Elasticsearch is green
kubectl get elasticsearch -n elastic-system

image

  • Check if Kibana works

image

  • Check if Fluentbit works

image

  • To Clean up everything (including the nodes themselvs)
terraform destroy

image