Skip to content

Commit

Permalink
feat: enable auto-healing, update to Debian 10 (#119)
Browse files Browse the repository at this point in the history
This patch adds an auto-healing policy to automatically re-create the
vault cluster instance if the vault server stops.  One of the nodes in
the instance group is active as per [Vault HA][ha].  The other nodes are
passive and forward requests to the active node.  Two different health
checks are used because passive nodes return non-200 status codes by
default.

In addition, this patch:

 * Update Vault to 1.6.0 by default
 * Update image to Debian 10 by default
 * Defaults to e2-standard-2 instance types, which are less expensive
   and more performant than n1-standard-1.
 * Improves startup (and auto-heal recovery) time by starting the vault
   service as quickly as possible in the startup-script.

[ha]: https://www.vaultproject.io/docs/concepts/ha.html
  • Loading branch information
jeffmccune committed Nov 16, 2020
1 parent 03259d2 commit 1d0b5db
Show file tree
Hide file tree
Showing 13 changed files with 302 additions and 71 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -209,12 +209,12 @@ done
| vault\_allowed\_cidrs | List of CIDR blocks to allow access to the Vault nodes. Since the load balancer is a pass-through load balancer, this must also include all IPs from which you will access Vault. The default is unrestricted (any IP address can access Vault). It is recommended that you reduce this to a smaller list. | list(string) | `<list>` | no |
| vault\_args | Additional command line arguments passed to Vault server | string | `""` | no |
| vault\_ca\_cert\_filename | GCS object path within the vault_tls_bucket. This is the root CA certificate. | string | `"ca.crt"` | no |
| vault\_instance\_base\_image | Base operating system image in which to install Vault. This must be a Debian-based system at the moment due to how the metadata startup script runs. | string | `"debian-cloud/debian-9"` | no |
| vault\_instance\_base\_image | Base operating system image in which to install Vault. This must be a Debian-based system at the moment due to how the metadata startup script runs. | string | `"debian-cloud/debian-10"` | no |
| vault\_instance\_labels | Labels to apply to the Vault instances. | map(string) | `<map>` | no |
| vault\_instance\_metadata | Additional metadata to add to the Vault instances. | map(string) | `<map>` | no |
| vault\_instance\_tags | Additional tags to apply to the instances. Note 'allow-ssh' and 'allow-vault' will be present on all instances. | list(string) | `<list>` | no |
| vault\_log\_level | Log level to run Vault in. See the Vault documentation for valid values. | string | `"warn"` | no |
| vault\_machine\_type | Machine type to use for Vault instances. | string | `"n1-standard-1"` | no |
| vault\_machine\_type | Machine type to use for Vault instances. | string | `"e2-standard-2"` | no |
| vault\_max\_num\_servers | Maximum number of Vault server nodes to run at one time. The group will not autoscale beyond this number. | string | `"7"` | no |
| vault\_min\_num\_servers | Minimum number of Vault server nodes in the autoscaling group. The group will not have less than this number of nodes. | string | `"1"` | no |
| vault\_port | Numeric port on which to run and expose Vault. | string | `"8200"` | no |
Expand All @@ -227,7 +227,7 @@ done
| vault\_tls\_kms\_key\_project | Project ID where the KMS key is stored. By default, same as `project_id` | string | `""` | no |
| vault\_tls\_require\_and\_verify\_client\_cert | Always use client certificates. You may want to disable this if users will not be authenticating to Vault with client certificates. | string | `"false"` | no |
| vault\_ui\_enabled | Controls whether the Vault UI is enabled and accessible. | string | `"true"` | no |
| vault\_version | Version of vault to install. This version must be 1.0+ and must be published on the HashiCorp releases service. | string | `"1.1.3"` | no |
| vault\_version | Version of vault to install. This version must be 1.0+ and must be published on the HashiCorp releases service. | string | `"1.6.0"` | no |

## Outputs

Expand Down
9 changes: 6 additions & 3 deletions modules/cluster/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ module "vault_cluster" {
| Name | Description | Type | Default | Required |
|------|-------------|:----:|:-----:|:-----:|
| domain | The domain name that will be set in the api_addr. Load Balancer IP used by default | string | `""` | no |
| hc\_initial\_delay\_secs | The number of seconds that the managed instance group waits before it applies autohealing policies to new instances or recently recreated instances. | number | `"60"` | no |
| host\_project\_id | ID of the host project if using Shared VPC | string | `""` | no |
| http\_proxy | HTTP proxy for downloading agents and vault executable on startup. Only necessary if allow_public_egress is false. This is only used on the first startup of the Vault cluster and will NOT set the global HTTP_PROXY environment variable. i.e. If you configure Vault to manage credentials for other services, default HTTP routes will be taken. | string | `""` | no |
| ip\_address | The IP address to assign the forwarding rules to. | string | n/a | yes |
Expand All @@ -40,6 +41,7 @@ module "vault_cluster" {
| kms\_protection\_level | The protection level to use for the KMS crypto key. | string | `"software"` | no |
| load\_balancing\_scheme | Options are INTERNAL or EXTERNAL. If `EXTERNAL`, the forwarding rule will be of type EXTERNAL and a public IP will be created. If `INTERNAL` the type will be INTERNAL and a random RFC 1918 private IP will be assigned | string | `"EXTERNAL"` | no |
| manage\_tls | Set to `false` if you'd like to manage and upload your own TLS files. See `Managing TLS` for more details | bool | `"true"` | no |
| min\_ready\_sec | Minimum number of seconds to wait before considering a new or restarted instance as updated. This value must be from range. [0,3600] | number | `"0"` | no |
| project\_id | ID of the project in which to create resources and add IAM bindings. | string | n/a | yes |
| region | Region in which to create resources. | string | `"us-east4"` | no |
| service\_account\_project\_additional\_iam\_roles | List of custom IAM roles to add to the project. | list(string) | `<list>` | no |
Expand All @@ -56,12 +58,12 @@ module "vault_cluster" {
| user\_startup\_script | Additional user-provided code injected after Vault is setup | string | `""` | no |
| vault\_args | Additional command line arguments passed to Vault server | string | `""` | no |
| vault\_ca\_cert\_filename | GCS object path within the vault_tls_bucket. This is the root CA certificate. | string | `"ca.crt"` | no |
| vault\_instance\_base\_image | Base operating system image in which to install Vault. This must be a Debian-based system at the moment due to how the metadata startup script runs. | string | `"debian-cloud/debian-9"` | no |
| vault\_instance\_base\_image | Base operating system image in which to install Vault. This must be a Debian-based system at the moment due to how the metadata startup script runs. | string | `"debian-cloud/debian-10"` | no |
| vault\_instance\_labels | Labels to apply to the Vault instances. | map(string) | `<map>` | no |
| vault\_instance\_metadata | Additional metadata to add to the Vault instances. | map(string) | `<map>` | no |
| vault\_instance\_tags | Additional tags to apply to the instances. Note 'allow-ssh' and 'allow-vault' will be present on all instances. | list(string) | `<list>` | no |
| vault\_log\_level | Log level to run Vault in. See the Vault documentation for valid values. | string | `"warn"` | no |
| vault\_machine\_type | Machine type to use for Vault instances. | string | `"n1-standard-1"` | no |
| vault\_machine\_type | Machine type to use for Vault instances. | string | `"e2-standard-2"` | no |
| vault\_max\_num\_servers | Maximum number of Vault server nodes to run at one time. The group will not autoscale beyond this number. | string | `"7"` | no |
| vault\_min\_num\_servers | Minimum number of Vault server nodes in the autoscaling group. The group will not have less than this number of nodes. | string | `"1"` | no |
| vault\_port | Numeric port on which to run and expose Vault. | string | `"8200"` | no |
Expand All @@ -76,7 +78,8 @@ module "vault_cluster" {
| vault\_tls\_kms\_key\_project | Project ID where the KMS key is stored. By default, same as `project_id` | string | `""` | no |
| vault\_tls\_require\_and\_verify\_client\_cert | Always use client certificates. You may want to disable this if users will not be authenticating to Vault with client certificates. | string | `"false"` | no |
| vault\_ui\_enabled | Controls whether the Vault UI is enabled and accessible. | string | `"true"` | no |
| vault\_version | Version of vault to install. This version must be 1.0+ and must be published on the HashiCorp releases service. | string | `"1.1.3"` | no |
| vault\_version | Version of vault to install. This version must be 1.0+ and must be published on the HashiCorp releases service. | string | `"1.6.0"` | no |
| zones | The zones to distribute instances across. If empty, all zones in the region are used. ['us-west1-a', 'us-west1-b', 'us-west1-c'] | list(string) | `<list>` | no |

## Outputs

Expand Down
44 changes: 43 additions & 1 deletion modules/cluster/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,19 @@ locals {
api_addr = var.domain != "" ? "https://${var.domain}:${var.vault_port}" : "https://${local.lb_ip}:${var.vault_port}"
host_project = var.host_project_id != "" ? var.host_project_id : var.project_id
lb_ip = local.use_external_lb ? google_compute_forwarding_rule.external[0].ip_address : var.ip_address
# LB and Autohealing health checks have different behavior. The load
# balancer shouldn't route traffic to a secondary vault instance, but it
# should consider the instance healthy for autohealing purposes.
# See: https://www.vaultproject.io/api-docs/system/health
hc_workload_request_path = "/v1/sys/health?uninitcode=200"
hc_autoheal_request_path = "/v1/sys/health?uninitcode=200&standbyok=true"
# Default to all zones in the region unless zones were provided.
zones = length(var.zones) > 0 ? var.zones : data.google_compute_zones.available.names
}

data "google_compute_zones" "available" {
project = var.project_id
region = var.region
}

resource "google_compute_instance_template" "vault" {
Expand Down Expand Up @@ -89,7 +102,7 @@ resource "google_compute_health_check" "vault_internal" {

https_health_check {
port = var.vault_port
request_path = "/v1/sys/health?uninitcode=200"
request_path = local.hc_workload_request_path
}
}

Expand Down Expand Up @@ -140,6 +153,7 @@ resource "google_compute_http_health_check" "vault" {
healthy_threshold = 2
unhealthy_threshold = 2
port = var.vault_proxy_port
request_path = local.hc_workload_request_path
}


Expand Down Expand Up @@ -181,6 +195,18 @@ resource "google_compute_region_instance_group_manager" "vault" {
base_instance_name = "vault-${var.region}"
wait_for_instances = false

auto_healing_policies {
health_check = google_compute_health_check.autoheal.id
initial_delay_sec = var.hc_initial_delay_secs
}

update_policy {
type = "OPPORTUNISTIC"
minimal_action = "REPLACE"
max_unavailable_fixed = length(local.zones)
min_ready_sec = var.min_ready_sec
}

target_pools = local.use_external_lb ? [google_compute_target_pool.vault[0].self_link] : []

named_port {
Expand Down Expand Up @@ -212,3 +238,19 @@ resource "google_compute_region_autoscaler" "vault" {
}

}

# Auto-healing
resource "google_compute_health_check" "autoheal" {
project = var.project_id
name = "vault-health-autoheal"

check_interval_sec = 10
timeout_sec = 5
healthy_threshold = 1
unhealthy_threshold = 2

https_health_check {
port = var.vault_port
request_path = local.hc_autoheal_request_path
}
}
1 change: 1 addition & 0 deletions modules/cluster/startup.tf
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ data "template_file" "vault-config" {
storage_bucket = var.vault_storage_bucket
vault_log_level = var.vault_log_level
vault_port = var.vault_port
vault_proxy_port = var.vault_proxy_port
vault_tls_disable_client_certs = var.vault_tls_disable_client_certs
vault_tls_require_and_verify_client_cert = var.vault_tls_require_and_verify_client_cert
vault_ui_enabled = var.vault_ui_enabled
Expand Down
16 changes: 12 additions & 4 deletions modules/cluster/templates/config.hcl.tpl
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Run Vault in HA mode. Even if there's only one Vault node, it doesn't hurt to
# have this set.
api_addr = "${api_addr}"
api_addr = "${api_addr}"
# LOCAL_IP is replaced with the eth0 IP address by the startup script.
cluster_addr = "https://LOCAL_IP:8201"

# Set debugging level
Expand Down Expand Up @@ -32,7 +33,14 @@ listener "tcp" {
tls_disable = 1
}

# Create an mTLS listener on the load balancer
# Create non-TLS listener for the HTTP legacy health checks. Make sure the VPC
# firewall doesn't allow traffic to this port except from the probe IP range.
listener "tcp" {
address = "${lb_ip}:${vault_proxy_port}"
tls_disable = 1
}

# Create an mTLS listener on the load balancer address
listener "tcp" {
address = "${lb_ip}:${vault_port}"
tls_cert_file = "/etc/vault.d/tls/vault.crt"
Expand All @@ -44,8 +52,8 @@ listener "tcp" {
}

# Create an mTLS listener locally. Client's shouldn't talk to Vault directly,
# but not all clients are well-behaved. This is also needed so the nodes can
# communicate with eachother.
# but not all clients are well-behaved. This is also needed so the cluster
# nodes can communicate with each other.
listener "tcp" {
address = "LOCAL_IP:${vault_port}"
tls_cert_file = "/etc/vault.d/tls/vault.crt"
Expand Down
84 changes: 49 additions & 35 deletions modules/cluster/templates/startup.sh.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -17,24 +17,23 @@ if [ ! -z '${custom_http_proxy}' ]; then
export https_proxy=$http_proxy
fi

# Get Vault up and running as quickly as possible to get the auto-heal health
# check passing. This results in faster recovery and faster rolling upgrades.

# Deps
export DEBIAN_FRONTEND=noninteractive
apt-get update -yqq
apt-get upgrade -yqq
apt-get install -yqq jq libcap2-bin logrotate unzip

# Install Stackdriver for logging and monitoring
curl -sSfL https://dl.google.com/cloudagents/install-logging-agent.sh | bash
curl -sSfL https://dl.google.com/cloudagents/install-monitoring-agent.sh | bash

# Download and install Vault
cd /tmp && \
curl -sLfO "https://releases.hashicorp.com/vault/${vault_version}/vault_${vault_version}_linux_amd64.zip" && \
unzip "vault_${vault_version}_linux_amd64.zip" && \
mv vault /usr/local/bin/vault && \
rm "vault_${vault_version}_linux_amd64.zip"
curl -sLfo /tmp/vault.zip "https://releases.hashicorp.com/vault/${vault_version}/vault_${vault_version}_linux_amd64.zip"
# Unzip without having to apt install unzip
(echo "import sys"; echo "import zipfile"; echo "with zipfile.ZipFile(sys.argv[1]) as z:"; echo ' z.extractall("/tmp")') | python3 - /tmp/vault.zip
install -o0 -g0 -m0755 -D /tmp/vault /usr/local/bin/vault
rm /tmp/vault.zip /tmp/vault

# Give Vault the ability to run mlock as non-root
if ! [[ -x /sbin/setcap ]]; then
apt install -qq -y libcap2-bin
fi
/sbin/setcap cap_ipc_lock=+ep /usr/local/bin/vault

# Add Vault user
Expand Down Expand Up @@ -83,7 +82,8 @@ touch /var/log/vault/{audit,server}.log
chmod 0640 /var/log/vault/{audit,server}.log
chown -R vault:adm /var/log/vault

# Add the TLS ca.crt to the trusted store so plugins dont error with TLS handshakes
# Add the TLS ca.crt to the trusted store so plugins dont error with TLS
# handshakes
cp /etc/vault.d/tls/ca.crt /usr/local/share/ca-certificates/
update-ca-certificates

Expand All @@ -94,6 +94,8 @@ Description="HashiCorp Vault"
Documentation=https://www.vaultproject.io/docs/
Requires=network-online.target
After=network-online.target
# Stop after the shutdown script stops.
Before=google-shutdown-scripts.service
ConditionFileNotEmpty=/etc/vault.d/config.hcl
[Service]
Expand Down Expand Up @@ -125,6 +127,9 @@ EOF
chmod 0644 /etc/systemd/system/vault.service
systemctl daemon-reload
systemctl enable vault
systemctl start vault

## AT THIS POINT VAULT HEALTH CHECKS SHOULD START PASSING

# Prevent core dumps - from all attack vectors
cat <<"EOF" > /etc/sysctl.d/50-coredump.conf
Expand Down Expand Up @@ -165,21 +170,6 @@ EOF
chmod 644 /etc/profile.d/vault.sh
source /etc/profile.d/vault.sh

if [ ${internal_lb} != true ]; then
# Add health-check proxy because target pools don't support HTTPS
apt-get install -yqq nginx

cat <<EOF > /etc/nginx/sites-available/default
server {
listen ${vault_proxy_port};
location / {
proxy_pass $VAULT_ADDR/v1/sys/health?uninitcode=200;
}
}
EOF
systemctl enable nginx
systemctl restart nginx
fi
# Pull Vault data from syslog into a file for fluentd
cat <<"EOF" > /etc/rsyslog.d/vault.conf
#
Expand All @@ -196,8 +186,18 @@ if ( $programname == "vault" ) then {
EOF
systemctl restart rsyslog

# Install Stackdriver for logging and monitoring
# Logging Agent: https://cloud.google.com/logging/docs/agent/installation
curl -sSfL https://dl.google.com/cloudagents/add-logging-agent-repo.sh | bash
# Monitoring Agent: https://cloud.google.com/monitoring/agent/installation
curl -sSfL https://dl.google.com/cloudagents/add-monitoring-agent-repo.sh | bash
apt-get update -yqq
# Install structured logs
apt-get install -yqq 'stackdriver-agent=6.*' 'google-fluentd=1.*' google-fluentd-catch-all-config-structured

# Start Stackdriver logging agent and setup the filesystem to be ready to
# receive audit logs
mkdir -p /etc/google-fluentd/config.d
cat <<"EOF" > /etc/google-fluentd/config.d/vaultproject.io.conf
<source>
@type tail
Expand Down Expand Up @@ -249,7 +249,11 @@ EOF
systemctl enable google-fluentd
systemctl restart google-fluentd

# Install logrotate
apt-get install -yqq logrotate

# Configure logrotate for Vault audit logs
mkdir -p /etc/logrotate.d
cat <<"EOF" > /etc/logrotate.d/vaultproject.io
/var/log/vault/*.log {
daily
Expand All @@ -260,25 +264,35 @@ cat <<"EOF" > /etc/logrotate.d/vaultproject.io
create 0640 vault adm
sharedscripts
postrotate
kill -HUP $(pidof vault)
/bin/systemctl reload vault 2> /dev/null
true
endscript
}
EOF

# Start Stackdriver monitoring
curl -sSfLo /opt/stackdriver/collectd/etc/collectd.d/statsd.conf https://raw.githubusercontent.com/Stackdriver/stackdriver-agent-service-configs/master/etc/collectd.d/statsd.conf
mkdir -p /opt/stackdriver/collectd/etc/collectd.d /etc/stackdriver/collectd.d
curl -sSfLo /etc/stackdriver/collectd.d/statsd.conf \
https://raw.githubusercontent.com/Stackdriver/stackdriver-agent-service-configs/master/etc/collectd.d/statsd.conf

# On GCE instances, swap is not enabled. The collectd swap plugin is enabled
# by default and generates frequent error messages trying to divide by zero
# when there is no swap. This perl command is an in-place edit to disable the
# swap plugin. The intent is to prevent the spurious log messages and avoid
# having to filter them in stackdriver.
#
# The error string related to this is:
# `wg_typed_value_create_from_value_t_inline failed for swap/percent/value`
# See https://issuetracker.google.com/issues/161054680#comment5
perl -i -pe 'BEGIN{undef $/;} s,LoadPlugin swap.*?/Plugin>,# swap plugin disabled by startup-script,smg' /etc/stackdriver/collectd.conf

systemctl enable stackdriver-agent
systemctl restart stackdriver-agent
service stackdriver-agent restart

#########################################
## user_startup_script ##
#########################################
${user_startup_script}


# Signal this script has run
touch ~/.startup-script-complete

# Reboot to pick up system-level changes
sudo reboot

0 comments on commit 1d0b5db

Please sign in to comment.