Skip to content

Latest commit

 

History

History
267 lines (213 loc) · 14.1 KB

rfc-141-application-healthchecks.md

File metadata and controls

267 lines (213 loc) · 14.1 KB
status implementation status_last_reviewed
accepted
done
2024-03-04

Application Healthchecks

Summary

We currently use application healthchecks for three things:

  • For load balancing, to determine if an instance is ready to serve requests.
  • For continuous deployments, to determine (along with other smoke tests) if a release can be automatically pushed to production.
  • For alerting, so that 2ndline know if an instance has a problem which needs manual intervention to fix.

We implement healthchecks using the GovukHealthcheck module in govuk_app_config.

Currently, a healthcheck response is always served with an HTTP status of 200, and indicates in the body whether the instance's status is "ok", "warning", or "critical". However, AWS load balancers use the HTTP status, not the response body, to determine whether an instance is healthy. So we will continue to send requests to an instance which is in a "warning" or "critical" state.

This RFC proposes that we standardise on our healthchecks returning a 200 or a 500 status code, which has some further ramifications on monitoring and alerting (discussed below).

A full list of healthchecks which will need updating is given at the end of the document.

Problem

We overload the meaning of "healthcheck", and cover both application health as well as instance health with our /healthcheck endpoints. This introduces confusion, and means that we cannot currently use healthchecks in load balancing.

Here are some examples of healthchecks which are suitable for load balancing and for post-deployment checks:

If an app can't talk to a backing service it relies on, then it probably can't be trusted to handle any requests. One of these healthchecks failing could indicate a networking or a configuration issue.

Here are some examples of healthchecks which are unsuitable for load balancing or for post-deployment checks:

If one of these healthchecks fail, the app instances themselves are probably fine. The problem is elsewhere.

Two of these indicate capacity problems outside of the instance, and the third is an alert that a manual maintenance procedure needs to be performed some time in the next two months. They all share the property that if one instance reports a failure, all will. This makes them unsuitable for load balancing purposes.

Even if we don't make our healthchecks suitable for load balancing, as GOV.UK moves towards continuous deployments, we will end up in situations where an automatic deployment is aborted because of something unrelated to the change being deployed. That's not good.

A healthcheck should check that the instance is running and isn't completely broken. That's all.

Proposal

I propose that we adopt these definitions:

A liveness healthcheck is an HTTP endpoint which MUST return an HTTP status of 200.

A readiness healthcheck is an HTTP endpoint which MUST return an HTTP status of either 200 (if the instance may receive requests) or 500 (if it should not). It MAY also return details in the response body as a JSON object.

And furthermore that we commit to deprecating the /healthcheck endpoint and implementing the new healthchecks at endpoints /healthcheck/live and /healthcheck/ready.

This new approach to readiness healthchecks is more suitable for our three purposes than before:

  • For load balancing, an unhealthy app will be taken out of the pool.
  • For continuous deployments, we can check the HTTP status rather than read the response body.
  • For alerting, we can continue to read the response body.

We get no immediate benefit from the liveness healthcheck, but will do when we have replatformed.

Adopting these definitions gives us some migration work to do. We can't start serving non-"ok" healthchecks with an HTTP status of 500 right now, as some of our healthchecks are unsuitable for load balancing.

We will need to implement the proposal in stages:

  1. List all the healthchecks which need changing, because they don't indicate a critical failure of the app (done, see the appendix)
  2. For each such healthcheck:
    • remove it if it's not adding value, or
    • add a separate alert if it is
  3. For each application:
    • serve a liveness healthcheck on /healthcheck/live
    • serve a readiness healthcheck on /healthcheck/ready
  4. Update govuk-puppet and govuk-aws to use /healthcheck/ready instead of /healthcheck
  5. Remove the /healthcheck endpoint from every app.
  6. Change GovukHealthcheck to serve the appropriate HTTP status code.

As part of (2), we will remove the "warning" state some of our healthchecks return.

Why separate healthchecks?

Separate liveness and readiness healthchecks are a common best practice. They are separate because they serve different purposes:

  • Liveness is used by the container orchestrator (such as Amazon ECS, which we are replatforming to) to determine if an instance has crashed or entered some other unrecoverable state, and must be restarted.

  • Readiness is used by the network load balancers to determine if an instance should be sent traffic.

For example, let's say we have an application which uses a database, and that due to some transient fault (like a network partition) some of the instances cannot reach the database. We do not want to send traffic to those instances, so their readiness healthchecks should fail. But restarting the instances won't resolve the problem, as it's an issue with the underlying infrastructure, so we don't want their liveness healthchecks to fail. We want the instances to keep running, so that when the transient fault recovers, they can quickly begin serving traffic again.

On the other hand, let's say an instance exhausts all its memory, and can't handle any inbound requests at all. The liveness (and readiness) healthcheck will fail, due to timing out, and ECS will restart the instance.

Remove the "warning" state

There are four possible semantics we could assign the "warning" state:

# State Allows automatic deployments? Allows requests to be sent to the instance?
ok yes yes
1 warning yes yes
2 warning yes no
3 warning no yes
4 warning no no
critical no no
  • If we go for option 1, then "warning" is the same as "ok".
  • If we go for option 2, then "warning" will let us deploy unusable releases.
  • If we go for option 3, then "warning" will block deployments.
  • If we go for option 4, then "warning" is the same as "critical".

The most sensible option is 3, which is our current behaviour.

But does it really gain us anything over having separate alerts for the specific condition we need to know about?

By removing the "warning" state, we will remove some spurious alerts which don't add value, and add more specific alerts for ones which do.

Gaps in metrics

Some of the unsuitable healthchecks correspond to conditions we need to alert about. Some can be directly implemented as Icinga checks by drawing on data we already report to graphite, but others will need new metrics to be made available first.

Prometheus, which we are adopting in the replatforming work, is a pull-based metrics gathering tool, which nicely solves the problem of how to get application state into an alert. But we're not replatformed yet, so until then we may need to do something like email-alert-api, which has a worker to push metrics to graphite.

Appendix: Healthchecks which need changing

The following healthchecks need updating before GovukHealthcheck can serve an HTTP status of 500:

App or Gem Check Reason
govuk_app_config GovukHealthcheck::SidekiqQueueCheck checks for a capacity issue
govuk_app_config GovukHealthcheck::SidekiqQueueLatencyCheck checks for a capacity issue
govuk_app_config GovukHealthcheck::SidekiqRetrySizeCheck checks for a capacity issue
govuk_app_config GovukHealthcheck::ThresholdCheck checks for a capacity issue
content-publisher Healthcheck::GovernmentDataCheck a failure doesn't seem to totally impair the instance
finder-frontend Healthcheck::RegistriesCache a failure doesn't seem to totally impair the instance
publisher Healthcheck::ScheduledPublishing checks for a capacity issue
publishing-api Healthcheck::QueueLatency checks for a capacity issue
search-api Healthcheck::ElasticsearchIndexDiskspaceCheck checks for a capacity issue
search-api Healthcheck::RerankerHealthcheck checks for an error which is gracefully handled
search-api Healthcheck::SidekiqQueueLatenciesCheck checks for a capacity issue
signon Healthcheck::ApiTokens checks for an upcoming maintenance task

Notes:

  • asset-manager isn't using GovukHealthcheck, but it has an equivalent implementation.
  • content-data-admin has a nonstandard healthcheck which reports an error to Sentry.
  • finder-frontend has a /healthcheck and a /healthcheck.json which do different things.
  • Various apps don't have any healthcheck endpoint at all.
  • Various apps have a healthcheck endpoint which is just proc { [200, {}, []] }.