RKE2 Cluster with Autoscaling & API Server HA

Reason for Being

This Terraform plan is for creating a multi-node RKE2 cluster in vSphere with machine pool autoscaling via upstream K8s Cluster Autoscaler & API Server HA via a kube-vip DaemonSet manifest - both of these are common asks and bring our cluster some "cloud-provider-like" behaviors in the comfort of our own datacenter.

Environment Prerequisites

Functional Rancher Management Server with vSphere Cloud Credential
vCenter >= 7.x and credentials with appropriate permissions (see https://github.com/rancher/barn/blob/main/Walkthroughs/vSphere/Permissions/README.md)
Virtual Machine Hardware Compatibility at Version >= 15
Create the following in the files/ directory:

NAME PURPOSE

.rancher-api-url URL for Rancher Management Server

.rancher-bearer-token API bearer token generated via Rancher UI

.ssh-public-key SSH public key for additional OS user
Since this plan leverages BGP for K8s Control Plane load balancing, a router capable of BGP is required. For lab/dev/test use, a small single-CPU Linux VM running BIRD v2 daemon (sudo apt install bird2) with the following config would suffice:

protocol bgp kubevip {
        description "kube-vip for Cluster CP";
        local <router eth interface IP address> as 64513;
        neighbor range <network prefix of Control Plane subnet> as <AS value configured in kube-vip manifest>;
        graceful restart;
        ipv4 {
                import filter {accept;};
                export filter {reject;};
        };
        dynamic name "kubeVIP";
}

Caveats

The cluster_autoscaler.tf plan includes the following values in ExtraArgs:

skip-nodes-with-local-storage: false
skip-nodes-with-system-pods: false

Those exist here to make the autoscaler logic more easily demonstrable and should be used with caution in production or any other environment you care about, as they could incur data loss or workload instability.

The lifecycle block in cluster.tf is somewhat fragile:

lifecycle {
  ignore_changes = [
    rke_config[0].machine_pools[1].quantity
  ]
}

Starting from the [0] value, Terraform processes indices lexicographically - the "worker" pool is machine_pools[1] and "ctl_plane" pool is machine_pools[0] for no other reason than "worker" comes after "ctl_plane" from a dictionary perspective. Due to this, if the "ctl_plane" pool were to be renamed something like "x_ctl_plane", the incorrect machine pool would occupy the machine_pools[1] index, causing undesired behavior. To prevent this, basic variable validation is in place that forces MachinePool names to begin with ctl-plane and worker otherwise the below error will be thrown:

Err: MachinePool names must begin with 'ctl-plane' for Control Plane Node Pool & 'worker' for Autoscaling Worker Node
Pool.

To Run

terraform apply

Node pool min/max values are annotations that can be adjusted with the rancher_env.autoscale_annotations variable. Changing these values on a live cluster will not trigger a redeploy. Any nodes in the autoscaled pool selected for scale down and/or deletion will have a Taint applied that is visible in the Rancher UI:

Tested Versions

SOFTWARE	VERSION	DOCS
K8s Cluster Autoscaler	1.26.2	https://github.com/kubernetes/autoscaler/tree/master/charts/cluster-autoscaler#readme
kube-vip	0.6.2	https://kube-vip.io/docs/
Rancher Server	2.7.6	https://rancher.com/docs/rancher/v2.6/en/overview
Rancher Terraform Provider	3.1.1	https://registry.terraform.io/providers/rancher/rancher2/latest/docs
RKE2	1.26.8+rke2r1	https://docs.rke2.io
Terraform	1.4.6	https://www.terraform.io/docs
vSphere	8.0.1.00300	https://docs.vmware.com/en/VMware-vSphere/index.html

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
files		files
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cluster.tf		cluster.tf
cluster_autoscaler.tf		cluster_autoscaler.tf
data.tf		data.tf
example.terraform.tfvars		example.terraform.tfvars
providers.tf		providers.tf
terraform.tfvars		terraform.tfvars
variables.tf		variables.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

files

files

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

cluster.tf

cluster.tf

cluster_autoscaler.tf

cluster_autoscaler.tf

data.tf

data.tf

example.terraform.tfvars

example.terraform.tfvars

providers.tf

providers.tf

terraform.tfvars

terraform.tfvars

variables.tf

variables.tf

Repository files navigation

RKE2 Cluster with Autoscaling & API Server HA

Reason for Being

Environment Prerequisites

Caveats

To Run

Tested Versions

About

Languages

NAME	PURPOSE
.rancher-api-url	URL for Rancher Management Server
.rancher-bearer-token	API bearer token generated via Rancher UI
.ssh-public-key	SSH public key for additional OS user

License

frank-at-suse/vsphere_HA_autoscale_cluster

Folders and files

Latest commit

History

Repository files navigation

RKE2 Cluster with Autoscaling & API Server HA

Reason for Being

Environment Prerequisites

Caveats

To Run

Tested Versions

About

Topics

Resources

License

Stars

Watchers

Forks

Languages