-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable auto-healing, update to Debian 10 #119
Enable auto-healing, update to Debian 10 #119
Conversation
65802b4
to
7bca4d0
Compare
One thing to note on the health checks. The auto-heal health check behaves different from the LB health check. The LB's health check works with HTTPS directly, so nginx is no longer needed. The vault active instance is marked healthy, all other instances are treated as unhealthy by the load balancer. Standby instances return HTTP 429 by default (ref). In a failover, whichever instance becomes active will become healthy. For auto-healing, all nodes return 200 if Vault is running and unsealed (standbyok=true). Backwards compatibility tested by applying this change to a v5.1.0 |
7bca4d0
to
566b877
Compare
@onetwopunch Please ignore this for the time being, I'll update this PR once I get the tests passing. I can see |
b0a4e88
to
9721783
Compare
b740d8b
to
3737f4e
Compare
@onetwopunch Update on this, I've gotten it pretty close to ready, but there's still some additional testing I'd like to do to make sure. I'll pick it back up next week. Here's the high level of what I changed:
At this point I think a timing issue is all that remains, 180 seconds might be a tad too optimistic for everything to converge and be healthy:
|
3737f4e
to
a83b11e
Compare
aeecada
to
0b7a722
Compare
@onetwopunch This is ready for your review now. The integrating tests are stable now, the most recent 3 build have been green. |
EOF | ||
systemctl enable nginx | ||
systemctl restart nginx | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, nginx was used to redirect the google_compute_http_health_check, which always made a request for /
.
Today, the google_compute_http_health_check
is able to request a custom path, so nginx is no longer necessary.
variable "hc_initial_delay_secs" { | ||
description = "The number of seconds that the managed instance group waits before it applies autohealing policies to new instances or recently recreated instances." | ||
type = number | ||
default = 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically takes ~45 seconds to hit all of the systemd targets.
uri = URI("https://#{lb_ip}:8200/v1/sys/health") | ||
req = Net::HTTP::Get.new(uri.path) | ||
uri = URI("https://#{lb_ip}:8200/v1/sys/health?uninitcode=200") | ||
req = Net::HTTP::Get.new(uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Necessary to preserve the query params.
router = "cloud-nat-${var.subnet_region}-${random_id.name.hex}" | ||
|
||
create_router = true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Necessary to install vault in the shared vpc internal integration tests.
listener "tcp" { | ||
address = "${lb_ip}:${vault_proxy_port}" | ||
tls_disable = 1 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This HTTP listener replaces nginx, the plain http health check remains for backwards compatibility but could be replaced with an https check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few suggestions for cleanup, but other than that this looks great! Thanks for all the hard work on this.
# Signal this script has run | ||
touch ~/.startup-script-complete | ||
|
||
# Reboot to pick up system-level changes | ||
sudo reboot | ||
# sudo reboot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a reboot no longer necessary? If not, let's just remove this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer necessary, removed.
https://raw.githubusercontent.com/Stackdriver/stackdriver-agent-service-configs/master/etc/collectd.d/statsd.conf | ||
|
||
# Fix `wg_typed_value_create_from_value_t_inline` log spam | ||
# See https://github.com/openinfrastructure/platform/issues/44 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a dead link (or private repo). Could you please explain the purpose of the below perl script in comments or remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will update with link to https://issuetracker.google.com/issues/161054680
curl -sSfL https://dl.google.com/cloudagents/install-logging-agent.sh | bash | ||
curl -sSfL https://dl.google.com/cloudagents/install-monitoring-agent.sh | bash | ||
# apt-get upgrade -yqq | ||
# apt-get install -yqq jq libcap2-bin logrotate unzip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is logrotate
no longer needed? Seems we're still using it below via the logrotate.d
directory.
Also if a line in the script shouldn't be executed, please just remove as opposed to commenting it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I'll add that back in.
end | ||
end | ||
|
||
## Example list-instances output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Examples shouldn't live in comments. Please remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
This patch adds an auto-healing policy to automatically re-create the vault cluster instance if the vault server stops. One of the nodes in the instance group is active as per [Vault HA][ha]. The other nodes are passive and forward requests to the active node. Two different health checks are used because passive nodes return non-200 status codes by default. In addition, this patch: * Update Vault to 1.6.0 by default * Update image to Debian 10 by default * Defaults to e2-standard-2 instance types, which are less expensive and more performant than n1-standard-1. * Improves startup (and auto-heal recovery) time by starting the vault service as quickly as possible in the startup-script. [ha]: https://www.vaultproject.io/docs/concepts/ha.html
fa88057
to
1dc7163
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Will merge once tests pass
This patch adds an auto-healing policy to automatically re-create the
vault cluster instance if the vault server stops. One of the nodes in
the instance group is active as per Vault HA. The other nodes are
passive and forward requests to the active node. Two different health
checks are used because passive nodes return non-200 status codes by
default.
In addition, this patch:
and more performant than n1-standard-1.
service as quickly as possible in the startup-script.