Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable auto-healing, update to Debian 10 #119

Merged

Conversation

jeffmccune
Copy link
Contributor

@jeffmccune jeffmccune commented Nov 7, 2020

This patch adds an auto-healing policy to automatically re-create the
vault cluster instance if the vault server stops. One of the nodes in
the instance group is active as per Vault HA. The other nodes are
passive and forward requests to the active node. Two different health
checks are used because passive nodes return non-200 status codes by
default.

In addition, this patch:

  • Update Vault to 1.6.0 by default
  • Update image to Debian 10 by default
  • Defaults to e2-standard-2 instance types, which are less expensive
    and more performant than n1-standard-1.
  • Improves startup (and auto-heal recovery) time by starting the vault
    service as quickly as possible in the startup-script.

@jeffmccune
Copy link
Contributor Author

One thing to note on the health checks. The auto-heal health check behaves different from the LB health check. The LB's health check works with HTTPS directly, so nginx is no longer needed. The vault active instance is marked healthy, all other instances are treated as unhealthy by the load balancer. Standby instances return HTTP 429 by default (ref). In a failover, whichever instance becomes active will become healthy.

For auto-healing, all nodes return 200 if Vault is running and unsealed (standbyok=true).

Backwards compatibility tested by applying this change to a v5.1.0 examples/shared_vpc_internal cluster.

@jeffmccune
Copy link
Contributor Author

@onetwopunch Please ignore this for the time being, I'll update this PR once I get the tests passing. I can see test/fixtures/simple_external is failing.

@jeffmccune jeffmccune force-pushed the auto_healing branch 7 times, most recently from b0a4e88 to 9721783 Compare November 12, 2020 22:49
@jeffmccune jeffmccune changed the title Add auto-healing Enable auto-healing, update to Debian 10 Nov 12, 2020
@jeffmccune jeffmccune force-pushed the auto_healing branch 14 times, most recently from b740d8b to 3737f4e Compare November 14, 2020 00:00
@jeffmccune
Copy link
Contributor Author

@onetwopunch Update on this, I've gotten it pretty close to ready, but there's still some additional testing I'd like to do to make sure. I'll pick it back up next week.

Here's the high level of what I changed:

  1. Added Cloud NAT to us-west1 to install vault
  2. Re-worked the shared vpc internal integration tests a bit, primarily to surface information in the build output. The instance console serial-output is logged for failures now.
  3. Added assertions against the health status as reported by the instance group manager and auto heal health check.

At this point I think a timing issue is all that remains, 180 seconds might be a tad too optimistic for everything to converge and be healthy:

Profile: shared_vpc_internal
Version: (not specified)
Target:  local://

  ×  Vault: Shared VPC Configuration (2 failed)
     ✔  ILB configuration should be internal
     ✔  ILB configuration exit_status should eq 0
     ✔  ILB configuration stderr should eq ""
     ✔  Managed instances instances should become stable in 180 seconds
     ✔  Managed instances instances should at least one instance in the group
     ✔  Managed instances instances should be running
     ×  Managed instances instances should be healthy
     expected {"currentAction"=>"VERIFYING", "instanceHealth"=>[{"detailedHealthState"=>"UNKNOWN"}], "instanceStatus"=>"RUNNING"} to have all instanceHealth detailedHealthState HEALTHY values.

@jeffmccune jeffmccune force-pushed the auto_healing branch 4 times, most recently from aeecada to 0b7a722 Compare November 16, 2020 17:07
@jeffmccune
Copy link
Contributor Author

@onetwopunch This is ready for your review now. The integrating tests are stable now, the most recent 3 build have been green.

EOF
systemctl enable nginx
systemctl restart nginx
fi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, nginx was used to redirect the google_compute_http_health_check, which always made a request for /.

Today, the google_compute_http_health_check is able to request a custom path, so nginx is no longer necessary.

variable "hc_initial_delay_secs" {
description = "The number of seconds that the managed instance group waits before it applies autohealing policies to new instances or recently recreated instances."
type = number
default = 60
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically takes ~45 seconds to hit all of the systemd targets.

uri = URI("https://#{lb_ip}:8200/v1/sys/health")
req = Net::HTTP::Get.new(uri.path)
uri = URI("https://#{lb_ip}:8200/v1/sys/health?uninitcode=200")
req = Net::HTTP::Get.new(uri)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necessary to preserve the query params.

router = "cloud-nat-${var.subnet_region}-${random_id.name.hex}"

create_router = true
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necessary to install vault in the shared vpc internal integration tests.

listener "tcp" {
address = "${lb_ip}:${vault_proxy_port}"
tls_disable = 1
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This HTTP listener replaces nginx, the plain http health check remains for backwards compatibility but could be replaced with an https check.

Copy link
Contributor

@onetwopunch onetwopunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few suggestions for cleanup, but other than that this looks great! Thanks for all the hard work on this.

# Signal this script has run
touch ~/.startup-script-complete

# Reboot to pick up system-level changes
sudo reboot
# sudo reboot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a reboot no longer necessary? If not, let's just remove this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer necessary, removed.

https://raw.githubusercontent.com/Stackdriver/stackdriver-agent-service-configs/master/etc/collectd.d/statsd.conf

# Fix `wg_typed_value_create_from_value_t_inline` log spam
# See https://github.com/openinfrastructure/platform/issues/44
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dead link (or private repo). Could you please explain the purpose of the below perl script in comments or remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will update with link to https://issuetracker.google.com/issues/161054680

curl -sSfL https://dl.google.com/cloudagents/install-logging-agent.sh | bash
curl -sSfL https://dl.google.com/cloudagents/install-monitoring-agent.sh | bash
# apt-get upgrade -yqq
# apt-get install -yqq jq libcap2-bin logrotate unzip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is logrotate no longer needed? Seems we're still using it below via the logrotate.d directory.

Also if a line in the script shouldn't be executed, please just remove as opposed to commenting it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll add that back in.

end
end

## Example list-instances output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples shouldn't live in comments. Please remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@jeffmccune jeffmccune requested review from onetwopunch and removed request for a team November 16, 2020 19:37
This patch adds an auto-healing policy to automatically re-create the
vault cluster instance if the vault server stops.  One of the nodes in
the instance group is active as per [Vault HA][ha].  The other nodes are
passive and forward requests to the active node.  Two different health
checks are used because passive nodes return non-200 status codes by
default.

In addition, this patch:

 * Update Vault to 1.6.0 by default
 * Update image to Debian 10 by default
 * Defaults to e2-standard-2 instance types, which are less expensive
   and more performant than n1-standard-1.
 * Improves startup (and auto-heal recovery) time by starting the vault
   service as quickly as possible in the startup-script.

[ha]: https://www.vaultproject.io/docs/concepts/ha.html
Copy link
Contributor

@onetwopunch onetwopunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Will merge once tests pass

@onetwopunch onetwopunch merged commit 1d0b5db into terraform-google-modules:master Nov 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants