Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal - Application-defined "alive probe" #21142

Closed
icecrime opened this issue Mar 11, 2016 · 29 comments · Fixed by #23218
Closed

Proposal - Application-defined "alive probe" #21142

icecrime opened this issue Mar 11, 2016 · 29 comments · Fixed by #23218
Labels
kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny
Milestone

Comments

@icecrime
Copy link
Contributor

Problem statement

Docker currently doesn't provide any built-in way to determine if a container is "alive" in the sense that the service it provides is up and running. This is useful in many scenarios, for example:

This issue cover the support for "alive probes" at the Engine level.

Proposal

Every container would support one optional container-specific probe to determine whether a service is alive. This would translate in docker UX through several new command line options for the run sub-command (naming to be discussed):

Option Default value Description
--probe "" URI of an HTTP endpoint or in-container script to probe for service liveliness
--probe-interval 60 Interval in seconds between probes
--probe-retry 1 Number of successive probe failures before considering the container failing
--probe-timeout TBD Number of seconds for the probe run before failure is assumed
--probe-grace TBD Number of seconds since container start time before the probe is active

A container is considered alive when the probe returns 0 for a file:// based probe, or a status code in the 200-399 range for an http[s]:// based probe.

Examples:

docker run -d \
           --probe="file:///some/script.sh" \
           --probe-interval=120 myimage

Open questions

Implementation

  • Should it be a new value for the container state, or should it be a new field in the container definition?
  • Should probe failure be reported in the events API?

Restart policies

  • Should we back restart policies on the probe result?
  • Can we implicitly alter the behavior of --restart=always and --restart=on-failure when a --probe is specified? That is roughly equivalent to assuming that the restart policy was always backed by a default probe which behavior is to look for the container process being alive.

References

Ping @crosbymichael @tonistiigi @mgoelzer @aluzzardi @ehazlett

@tianon
Copy link
Member

tianon commented Mar 12, 2016

(continuing from #21143 (comment) -- worth reiterating that I'm definitely +1 to having "how to probe XYZ container for healthiness" as a bit of image metadata)

I'm a big +1 to the idea here -- it's very common to have a "retry" loop for dependent services (or something complex involving consul and custom code performing essentially this task), so I'd love to see Docker have a core concept of "health checking" embedded by default, especially since that "healthy" status could then be reported via docker inspect and consumed directly by other tools that are currently doing this work (or they can happily disable this feature and continue to do that work themselves 🍤).

As for the implementation, I'm a little bit concerned that we might be pigeonholing ourselves by only accepting URLs for "what to probe" -- would we plan to just use custom schemes like something://data if we ever come up with some other way we'd want to probe a container? (One other type of simple probe that comes to mind is whether or not the application is actually listening on a particular port -- in MySQL's case, we are very careful to make sure that it isn't listening on that externally accessible port until it's fully initialized and ready for external use, for example, so it'd be overkill to include a script that essentially just connects to that port and returns a successful status when Docker itself could verify much more quickly that the port is open without spawning a shell inside the container.) Would there be any value in varying and potentially user-defined levels of "healthiness" too? Have we looked at how other "health checking" systems are handling this type of value to see if there are any good ideas already in the space we can borrow? 😄

I'm also curious about what this probing would/could be used for in the engine itself -- the proposal touches on a few potential use cases (automatic restarting of "unhealthy" containers, for example), but I don't know whether that's being left intentionally vague so that we can discuss "health status" for containers first (ie, what that means and how to calculate/gather said status) and then discuss how it would interact with other features, or if it was just an oversight and there's already a set interaction in mind. 😄 (I'd reiterate that I definitely see value in a "health" status that's separate from anything the engine is doing to the container, so I'd love to see those features be orthogonally defined -- for example, something like a restart policy of unless-stopped,on-health:unhealthy or something.)

@aluzzardi
Copy link
Member

Huge +1 on this.

A few things:

--probe-grace
Number of seconds since container start time before the probe is active

I think the probe should be active immediately after container start. The grace period might be one minute for instance, but the container may be "alive" within a few seconds and we don't want to wait that long to report it as alive.

What the grace period could mean is the delay after container start for which we do not increase the failure count.

Example: --probe-grace 30s --probe-retry 3 --probe-interval 10s. We probe the container immediately after start then every 10s. If after 30 seconds it's still failing, then we set it as unhealthy.

--probe-retry defaults to 1

That's perhaps too aggressive: 3 might be more sensitive

--probe="file:///some/script.sh"

We play with URIs for discovery and I'm not a fan anymore. It's fragile, confusing and hard to customize (e.g. for HTTP you might want to specify a custom response code while for a script an exit code).

I'd suggest having something like --probe-driver=[http,exec], --probe-endpoint=[http://foo or /some/script.sh]. Then maybe --probe-opt exitcode=0 (similar to --log-driver / --log-opt, --storage-driver / --storage-opt).

Should it be a new value for the container state, or should it be a new field in the container definition?

IMO, the probe should be part of the container lifecycle and Up means the probe is successful.

This implies:

  • docker ps would show as Up containers that are alive, as defined by the probe
  • docker ps would show as Failed or whatnot containers that are failing the probe
  • docker run, like today, waits for the container to be Up before returning. This means it'd wait for the probe - when run returns the container is already listening to its TCP port etc
  • Like every other container state, it would be reported by docker events
  • If a probe fails, --restart=always will restart the container (since it's not running anymore)

@cpuguy83
Copy link
Member

channeling @crosbymichael, health checking on the same system that's running the service isn't a great healthcheck.

What if there was cluster-wide knowledge of health checks that any host can perform?

Checks could be backed by a driver interface, and pluggable with built-ins for simple tcp and http+url checks

@duglin
Copy link
Contributor

duglin commented Mar 13, 2016

@cpuguy83 I had similar thoughts but then I assumed, perhaps incorrectly, that most of this was to be supported by the Swarm manager and therefore wouldn't be on the same system. Its only in a smaller, dev/test, environments where doing the check on the same system would be ok.

But, overall I do like the idea of adding support at some level. I just would like to see the complete picture first. For example, putting it on the "docker run" command is interesting in limited cases, but my first thought would be that it should be in a compose.yml file. There are probably a few options and it would be good to discuss the long-term vision before we start down any particular path.

@elgalu
Copy link
Contributor

elgalu commented Mar 13, 2016

+1 on this or some sort of HEALTHCHECK instruction #21143 #7445
solving this in docker-compose would also help docker/compose#374 (comment)

@dongluochen
Copy link
Contributor

@cpuguy83 The health check from Engine is not meant to be complete but it provides value. A container failing Engine health check shows container is not functioning as expected. This feedback is useful for a few scenarios as described above. For example, users can stop rolling update with this feedback to investigate.

Container passing health check may not be reachable from external. Orchestration tools can add external monitoring. Combining the result from Engine health check and external monitoring would help failure diagnosis.

@aluzzardi
Copy link
Member

@cpuguy83 I think the name healthcheck is misleading. The goal of this system is to guarantee that the container is actually running, not that it's end-to-end reachable from the outside like a load balancer health check would do - that is out of scope.

@ehazlett
Copy link
Contributor

+1 for this as well. I agree it's not a full solution for checking the health of the service but a good initial data point. I also think the name is misleading although I'm not coming up with anything better atm.

@ghost
Copy link

ghost commented Mar 14, 2016

+1 to @aluzzardi 's idea to change to --probe-driver=[http,exec] and --probe-endpoint=[http://foo or /some/script.sh] instead of the URI syntax.

But I don't think --probe-opt exitcode=0 should exist. Too complicated. I think it should be an inherent characteristic of the exec probe that an exit code of 0 means passing. Similarly, http probe always interprets 200-399 as passing.

@ghost
Copy link

ghost commented Mar 14, 2016

I think we should consider calling it --health-check. My concern is that calling it a probe obfuscates what it is. With --health-check-*, anyone will instantly understand the purpose and intended use of these flags.

I believe the main objection to "health check" is that this is not a true check of health at the application level. For instance, it doesn't ensure the container is reachable, it relies on the Engine to essentially check itself which is not robust against Engine crashes, etc. I agree with all those points. However, it seems to me that this feature is a true container-level health check: it asserts that the process inside the container is running in a manner consistent with the expectations of the image's author. Because it's being specified in the context of a docker run, I think people would understand the limited scope of this health check.

(Also, I would point out that this container-level health check doesn't preclude the creation of a higher level health check within the Swarm manager.)

Are there other downsides to the name "health check" that I'm missing? What about a more accurately scoped name like --container-health-check?

@crosbymichael @cpuguy83

@mglasgow42
Copy link

Agree with @dongluochen that running the health check inside the container being monitored is suboptimal in many cases. If the container is wedged, it may not be able to update a file to indicate it is unhealthy.

Checking that a URL returns 200 may be fine if you just want to ensure a webserver is responding, but for complicated logic there should also be an ability to call a separate process outside the container being monitored. Developers can provide a separate healthcheck container for their service which does any arbitrary test that is needed, and we can examine the exitcode to see if it passed or not.

@GordonTheTurtle
Copy link

USER POLL

The best way to get notified of updates is to use the Subscribe button on this page.

Please don't use "+1" or "I have this too" comments on issues. We automatically
collect those comments to keep the thread short.

The people listed below have upvoted this issue by leaving a +1 comment:

@willejs

@gittycat
Copy link

This could finally allow docker compose to start dependent services in order.
For instance, serviceA depends on Elasticsearch which depends on Postgres.

Also, to me "health check" means a very limited scope test; basically is this service ready to accept requests and replies correctly to a request that doesn't involve any downstream services.

"Smoke tests" can be used for end to end testing.

@dnephin
Copy link
Member

dnephin commented Mar 15, 2016

On the design

I think the term probe is confusing and it should just be called a health check. The system that reads the health checks is responsible for dealing with cases where the container is incapable of performing the check. Any check that fails to run is equivalent to a check that runs and reports a failure.

I think seconds is the wrong unit for some of these. It should be milliseconds, especially for timeout.

On open questions

If it were added as a container state, changes in state should be reflected in the events feed, and a new restart policy on-unhealthy could be added to have it restarted. I don't think the existing restart policies would change.

If it is not a change in container state, it shouldn't be part of the event stream, and shouldn't impact restart policies at all.

I think it would be good to make it a new state, but that it's not absolutely necessary for V1.

@cpuguy83
Copy link
Member

Sequencing dependent containers (e.g., docker/compose#374)

I think it is bad practice to encourage ready checks for sequencing container starts. This should be handled at the application level... e.g. connect-to-db->fail->loop(connect-to-db)

Taking informed load-balancing decisions

Makes sense, like an event broadcasting that the container is ready.

Restarting the container on an application-specific criteria

Kind of ick to put Docker in this category.
A ready check and a constantly running health check are very different things.
I think the latter belongs in a monitoring framework.

@bobrik
Copy link
Contributor

bobrik commented Mar 15, 2016

I think it is bad practice to encourage ready checks for sequencing container starts. This should be handled at the application level... e.g. connect-to-db->fail->loop(connect-to-db)

How is it different from load balancing?

@stevvooe
Copy link
Contributor

In general, I am huge fan of this idea. Providing secondary checks for a process to indicate its liveness will only help to inform the docker engine.

The great thing about this concept is that it can be complementary to an external health checking system. Coupled with a plugin system, it can operate in concert with a larger system or simply as a local liveness check. Through the event API, it can be joined with remote data to inform service discovery. Remote health checks are still required to check for service access, but this will cover an important gap at the local level.

I do hate the name "probe". A probe implies the measurement of a remote value, such as a voltage or where the rebel scum may be located.

@konobi
Copy link

konobi commented Mar 25, 2016

How about checking the filesystem?

READY_ON /tmp/this_container_is_alive_and_healthy

@vsaraswat
Copy link

Also agree for the name "health check" over "probe." Probe is a bit vague whereas health check (even if just container level and not app level) is pretty widely understood by users

@jdavisclark
Copy link

Would it make more sense for the health check specification to be defined on the container being checked, and the link/dependency itself be defined on the dependent container, or is that too inflexible? This might not work well with a swarm, or otherwise outside of a local single node environment; I'm naive to the docker internals.

e.g. Ignore the method for health-checking, that part isn't super important, but assume --health-check did take multiple schemes via a protocol prefix or something; which in line with what @stevvooe said, makes for an easily extensible system:

  • dependency: docker run -d --name foo --health-check="port:9200" elasticsearch:latest
  • dependent: docker run -d --name bar --dependency=foo kibana:latest
  1. The dependency is defined as healthy, in this instance, when port 9200 is open/available. I'm thinking "I can telnet to it now", but whatever. Then that state info can be maintained on the dependency's container state, rather than having to poll/retry the entire health check from the dependent side.
  2. If image authors wanted to support it for standard scenarios, that exact same scheme would work in the dependencies dockerfile as well; in which case the CLI option could maybe act as an override (much like entrypoint and ports behave now)

@tianon
Copy link
Member

tianon commented Apr 12, 2016

IMO for the first pass it would be valuable to focus on defining how to determine and discover the "health" status (including how to monitor it for changes), and then separately discuss how that state impacts the rest of the system (container dependencies, etc); I fear that if we try to implement both in one shot that we'll end up hamstringing our implementation of "health checking" to solve a narrow use case (or a narrow set of use cases) 😞

@zh99998
Copy link

zh99998 commented Apr 22, 2016

consider provide a TCP/UDP check for non-http(s) apps, with user-defined request and response pattern.

@fanktom
Copy link

fanktom commented May 11, 2016

Instead of a check from the outside, could it make sense to be able to tell the run command when it can automatically detach?

This way, a container could be started with the usual docker run -d ..., however it would block until the started container issues some READY signal. The signal could be implemented in many ways. One way I can think of would be to declare something like a DETACHON /var/sock/some.sock. If DETACHON is not declared, the container detaches as soon as possible causing no breaking change. If DETACHON is declared, docker can watch the filesystem for a file, e.g. /var/sock/some.sock until it automatically detaches. This file can then be individually created by each image. With this approach, e.g. a database can be loaded with fixtures, and then create the READY file which is more complicated with a check from the outside in.

@thaJeztah
Copy link
Member

There's a pull request opened to implement this, so anyone that's interested, PTAL at #22719

@thaJeztah thaJeztah added the kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny label May 13, 2016
@AkihiroSuda
Copy link
Member

@thaJeztah Is this issue closable? #23218

@thaJeztah thaJeztah added this to the 1.12.0 milestone Jun 3, 2016
@thaJeztah
Copy link
Member

Yup! Same as the other one 👍

Implemented in #23218

@sanmai-NL
Copy link

sanmai-NL commented Sep 22, 2016

Where is the doc and/or schema for the output of docker inspect -f '{{json .State.Health}}' ...? I didn't see this discussed in this thread or elsewhere.

@sanmai-NL
Copy link

@talex5: thanks! Hope this gets documented someday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny
Projects
None yet
Development

Successfully merging a pull request may close this issue.