Enable auto-healing, update to Debian 10 #119

jeffmccune · 2020-11-07T00:13:10Z

This patch adds an auto-healing policy to automatically re-create the
vault cluster instance if the vault server stops. One of the nodes in
the instance group is active as per Vault HA. The other nodes are
passive and forward requests to the active node. Two different health
checks are used because passive nodes return non-200 status codes by
default.

In addition, this patch:

Update Vault to 1.6.0 by default
Update image to Debian 10 by default
Defaults to e2-standard-2 instance types, which are less expensive
and more performant than n1-standard-1.
Improves startup (and auto-heal recovery) time by starting the vault
service as quickly as possible in the startup-script.

jeffmccune · 2020-11-07T00:35:05Z

One thing to note on the health checks. The auto-heal health check behaves different from the LB health check. The LB's health check works with HTTPS directly, so nginx is no longer needed. The vault active instance is marked healthy, all other instances are treated as unhealthy by the load balancer. Standby instances return HTTP 429 by default (ref). In a failover, whichever instance becomes active will become healthy.

For auto-healing, all nodes return 200 if Vault is running and unsealed (standbyok=true).

Backwards compatibility tested by applying this change to a v5.1.0 examples/shared_vpc_internal cluster.

jeffmccune · 2020-11-07T00:49:31Z

@onetwopunch Please ignore this for the time being, I'll update this PR once I get the tests passing. I can see test/fixtures/simple_external is failing.

jeffmccune · 2020-11-14T00:06:32Z

@onetwopunch Update on this, I've gotten it pretty close to ready, but there's still some additional testing I'd like to do to make sure. I'll pick it back up next week.

Here's the high level of what I changed:

Added Cloud NAT to us-west1 to install vault
Re-worked the shared vpc internal integration tests a bit, primarily to surface information in the build output. The instance console serial-output is logged for failures now.
Added assertions against the health status as reported by the instance group manager and auto heal health check.

At this point I think a timing issue is all that remains, 180 seconds might be a tad too optimistic for everything to converge and be healthy:

Profile: shared_vpc_internal
Version: (not specified)
Target:  local://

  ×  Vault: Shared VPC Configuration (2 failed)
     ✔  ILB configuration should be internal
     ✔  ILB configuration exit_status should eq 0
     ✔  ILB configuration stderr should eq ""
     ✔  Managed instances instances should become stable in 180 seconds
     ✔  Managed instances instances should at least one instance in the group
     ✔  Managed instances instances should be running
     ×  Managed instances instances should be healthy
     expected {"currentAction"=>"VERIFYING", "instanceHealth"=>[{"detailedHealthState"=>"UNKNOWN"}], "instanceStatus"=>"RUNNING"} to have all instanceHealth detailedHealthState HEALTHY values.

jeffmccune · 2020-11-16T17:45:18Z

@onetwopunch This is ready for your review now. The integrating tests are stable now, the most recent 3 build have been green.

jeffmccune · 2020-11-16T17:49:28Z

modules/cluster/templates/startup.sh.tpl

-EOF
-  systemctl enable nginx
-  systemctl restart nginx
-fi


Previously, nginx was used to redirect the google_compute_http_health_check, which always made a request for /.

Today, the google_compute_http_health_check is able to request a custom path, so nginx is no longer necessary.

jeffmccune · 2020-11-16T17:51:56Z

modules/cluster/variables.tf

+variable "hc_initial_delay_secs" {
+  description = "The number of seconds that the managed instance group waits before it applies autohealing policies to new instances or recently recreated instances."
+  type        = number
+  default     = 60


Typically takes ~45 seconds to hit all of the systemd targets.

jeffmccune · 2020-11-16T17:53:05Z

test/integration/simple_external/controls/simple.rb

-        uri = URI("https://#{lb_ip}:8200/v1/sys/health")
-        req = Net::HTTP::Get.new(uri.path)
+        uri = URI("https://#{lb_ip}:8200/v1/sys/health?uninitcode=200")
+        req = Net::HTTP::Get.new(uri)


Necessary to preserve the query params.

jeffmccune · 2020-11-16T17:54:33Z

test/setup/main.tf

+  router     = "cloud-nat-${var.subnet_region}-${random_id.name.hex}"
+
+  create_router = true
+}


Necessary to install vault in the shared vpc internal integration tests.

jeffmccune · 2020-11-16T17:56:06Z

modules/cluster/templates/config.hcl.tpl

+listener "tcp" {
+  address     = "${lb_ip}:${vault_proxy_port}"
+  tls_disable = 1
+}


This HTTP listener replaces nginx, the plain http health check remains for backwards compatibility but could be replaced with an https check.

onetwopunch

Just a few suggestions for cleanup, but other than that this looks great! Thanks for all the hard work on this.

onetwopunch · 2020-11-16T17:56:22Z

modules/cluster/templates/startup.sh.tpl

 # Signal this script has run
 touch ~/.startup-script-complete

 # Reboot to pick up system-level changes
-sudo reboot
+# sudo reboot


Is a reboot no longer necessary? If not, let's just remove this line.

No longer necessary, removed.

onetwopunch · 2020-11-16T17:58:55Z

modules/cluster/templates/startup.sh.tpl

+  https://raw.githubusercontent.com/Stackdriver/stackdriver-agent-service-configs/master/etc/collectd.d/statsd.conf
+
+# Fix `wg_typed_value_create_from_value_t_inline` log spam
+# See https://github.com/openinfrastructure/platform/issues/44


This is a dead link (or private repo). Could you please explain the purpose of the below perl script in comments or remove it?

Good catch, will update with link to https://issuetracker.google.com/issues/161054680

onetwopunch · 2020-11-16T18:01:08Z

modules/cluster/templates/startup.sh.tpl

-curl -sSfL https://dl.google.com/cloudagents/install-logging-agent.sh | bash
-curl -sSfL https://dl.google.com/cloudagents/install-monitoring-agent.sh | bash
+# apt-get upgrade -yqq
+# apt-get install -yqq jq libcap2-bin logrotate unzip


Is logrotate no longer needed? Seems we're still using it below via the logrotate.d directory.

Also if a line in the script shouldn't be executed, please just remove as opposed to commenting it out.

Good catch, I'll add that back in.

onetwopunch · 2020-11-16T18:18:29Z

test/integration/helpers/shared_tests/shared_instance_group_tests.rb

+  end
+end
+
+## Example list-instances output


Examples shouldn't live in comments. Please remove.

This patch adds an auto-healing policy to automatically re-create the vault cluster instance if the vault server stops. One of the nodes in the instance group is active as per [Vault HA][ha]. The other nodes are passive and forward requests to the active node. Two different health checks are used because passive nodes return non-200 status codes by default. In addition, this patch: * Update Vault to 1.6.0 by default * Update image to Debian 10 by default * Defaults to e2-standard-2 instance types, which are less expensive and more performant than n1-standard-1. * Improves startup (and auto-heal recovery) time by starting the vault service as quickly as possible in the startup-script. [ha]: https://www.vaultproject.io/docs/concepts/ha.html

onetwopunch

LGTM. Will merge once tests pass

jeffmccune requested review from onetwopunch and a team as code owners November 7, 2020 00:13

jeffmccune force-pushed the auto_healing branch from 65802b4 to 7bca4d0 Compare November 7, 2020 00:18

jeffmccune force-pushed the auto_healing branch from 7bca4d0 to 566b877 Compare November 7, 2020 00:45

jeffmccune force-pushed the auto_healing branch 7 times, most recently from b0a4e88 to 9721783 Compare November 12, 2020 22:49

jeffmccune changed the title ~~Add auto-healing~~ Enable auto-healing, update to Debian 10 Nov 12, 2020

jeffmccune force-pushed the auto_healing branch 14 times, most recently from b740d8b to 3737f4e Compare November 14, 2020 00:00

jeffmccune force-pushed the auto_healing branch from 3737f4e to a83b11e Compare November 15, 2020 23:59

jeffmccune force-pushed the auto_healing branch 4 times, most recently from aeecada to 0b7a722 Compare November 16, 2020 17:07

jeffmccune commented Nov 16, 2020

View reviewed changes

onetwopunch suggested changes Nov 16, 2020

View reviewed changes

jeffmccune requested review from onetwopunch and removed request for a team November 16, 2020 19:37

jeffmccune force-pushed the auto_healing branch from fa88057 to 1dc7163 Compare November 16, 2020 19:57

onetwopunch approved these changes Nov 16, 2020

View reviewed changes

onetwopunch merged commit 1d0b5db into terraform-google-modules:master Nov 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable auto-healing, update to Debian 10 #119

Enable auto-healing, update to Debian 10 #119

jeffmccune commented Nov 7, 2020 •

edited

jeffmccune commented Nov 7, 2020

jeffmccune commented Nov 7, 2020

jeffmccune commented Nov 14, 2020

jeffmccune commented Nov 16, 2020

jeffmccune Nov 16, 2020

jeffmccune Nov 16, 2020 •

edited

jeffmccune Nov 16, 2020

jeffmccune Nov 16, 2020

jeffmccune Nov 16, 2020

onetwopunch left a comment

onetwopunch Nov 16, 2020

jeffmccune Nov 16, 2020

onetwopunch Nov 16, 2020

jeffmccune Nov 16, 2020

onetwopunch Nov 16, 2020

jeffmccune Nov 16, 2020

onetwopunch Nov 16, 2020

jeffmccune Nov 16, 2020

onetwopunch left a comment

Enable auto-healing, update to Debian 10 #119

Enable auto-healing, update to Debian 10 #119

Conversation

jeffmccune commented Nov 7, 2020 • edited

jeffmccune commented Nov 7, 2020

jeffmccune commented Nov 7, 2020

jeffmccune commented Nov 14, 2020

jeffmccune commented Nov 16, 2020

Choose a reason for hiding this comment

jeffmccune Nov 16, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

onetwopunch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

onetwopunch left a comment

Choose a reason for hiding this comment

jeffmccune commented Nov 7, 2020 •

edited

jeffmccune Nov 16, 2020 •

edited