Proposal

icecrime · 2016-03-11T23:33:57Z

Problem statement

Docker currently doesn't provide any built-in way to determine if a container is "alive" in the sense that the service it provides is up and running. This is useful in many scenarios, for example:

Sequencing dependent containers (e.g., Is there a way to delay container startup to support dependant services with a longer startup time docker/compose#374)
Taking informed load-balancing decisions
Restarting the container on an application-specific criteria

This issue cover the support for "alive probes" at the Engine level.

Every container would support one optional container-specific probe to determine whether a service is alive. This would translate in docker UX through several new command line options for the run sub-command (naming to be discussed):

Option	Default value	Description
`--probe`	`""`	URI of an HTTP endpoint or in-container script to probe for service liveliness
`--probe-interval`	`60`	Interval in seconds between probes
`--probe-retry`	`1`	Number of successive probe failures before considering the container failing
`--probe-timeout`	TBD	Number of seconds for the probe run before failure is assumed
`--probe-grace`	TBD	Number of seconds since container start time before the probe is active

A container is considered alive when the probe returns 0 for a file:// based probe, or a status code in the 200-399 range for an http[s]:// based probe.

Examples:

docker run -d \
           --probe="file:///some/script.sh" \
           --probe-interval=120 myimage

Open questions

Implementation

Should it be a new value for the container state, or should it be a new field in the container definition?
Should probe failure be reported in the events API?

Restart policies

Should we back restart policies on the probe result?
Can we implicitly alter the behavior of --restart=always and --restart=on-failure when a --probe is specified? That is roughly equivalent to assuming that the restart policy was always backed by a default probe which behavior is to look for the container process being alive.

References

Credits to @dongluochen for the initial design

Ping @crosbymichael @tonistiigi @mgoelzer @aluzzardi @ehazlett

The text was updated successfully, but these errors were encountered:

tianon · 2016-03-12T00:02:32Z

(continuing from #21143 (comment) -- worth reiterating that I'm definitely +1 to having "how to probe XYZ container for healthiness" as a bit of image metadata)

I'm a big +1 to the idea here -- it's very common to have a "retry" loop for dependent services (or something complex involving consul and custom code performing essentially this task), so I'd love to see Docker have a core concept of "health checking" embedded by default, especially since that "healthy" status could then be reported via docker inspect and consumed directly by other tools that are currently doing this work (or they can happily disable this feature and continue to do that work themselves 🍤).

As for the implementation, I'm a little bit concerned that we might be pigeonholing ourselves by only accepting URLs for "what to probe" -- would we plan to just use custom schemes like something://data if we ever come up with some other way we'd want to probe a container? (One other type of simple probe that comes to mind is whether or not the application is actually listening on a particular port -- in MySQL's case, we are very careful to make sure that it isn't listening on that externally accessible port until it's fully initialized and ready for external use, for example, so it'd be overkill to include a script that essentially just connects to that port and returns a successful status when Docker itself could verify much more quickly that the port is open without spawning a shell inside the container.) Would there be any value in varying and potentially user-defined levels of "healthiness" too? Have we looked at how other "health checking" systems are handling this type of value to see if there are any good ideas already in the space we can borrow? 😄

I'm also curious about what this probing would/could be used for in the engine itself -- the proposal touches on a few potential use cases (automatic restarting of "unhealthy" containers, for example), but I don't know whether that's being left intentionally vague so that we can discuss "health status" for containers first (ie, what that means and how to calculate/gather said status) and then discuss how it would interact with other features, or if it was just an oversight and there's already a set interaction in mind. 😄 (I'd reiterate that I definitely see value in a "health" status that's separate from anything the engine is doing to the container, so I'd love to see those features be orthogonally defined -- for example, something like a restart policy of unless-stopped,on-health:unhealthy or something.)

aluzzardi · 2016-03-13T04:52:09Z

Huge +1 on this.

A few things:

--probe-grace
Number of seconds since container start time before the probe is active

I think the probe should be active immediately after container start. The grace period might be one minute for instance, but the container may be "alive" within a few seconds and we don't want to wait that long to report it as alive.

What the grace period could mean is the delay after container start for which we do not increase the failure count.

Example: --probe-grace 30s --probe-retry 3 --probe-interval 10s. We probe the container immediately after start then every 10s. If after 30 seconds it's still failing, then we set it as unhealthy.

--probe-retry defaults to 1

That's perhaps too aggressive: 3 might be more sensitive

--probe="file:///some/script.sh"

We play with URIs for discovery and I'm not a fan anymore. It's fragile, confusing and hard to customize (e.g. for HTTP you might want to specify a custom response code while for a script an exit code).

I'd suggest having something like --probe-driver=[http,exec], --probe-endpoint=[http://foo or /some/script.sh]. Then maybe --probe-opt exitcode=0 (similar to --log-driver / --log-opt, --storage-driver / --storage-opt).

Should it be a new value for the container state, or should it be a new field in the container definition?

IMO, the probe should be part of the container lifecycle and Up means the probe is successful.

This implies:

docker ps would show as Up containers that are alive, as defined by the probe
docker ps would show as Failed or whatnot containers that are failing the probe
docker run, like today, waits for the container to be Up before returning. This means it'd wait for the probe - when run returns the container is already listening to its TCP port etc
Like every other container state, it would be reported by docker events
If a probe fails, --restart=always will restart the container (since it's not running anymore)

cpuguy83 · 2016-03-13T20:25:39Z

channeling @crosbymichael, health checking on the same system that's running the service isn't a great healthcheck.

What if there was cluster-wide knowledge of health checks that any host can perform?

Checks could be backed by a driver interface, and pluggable with built-ins for simple tcp and http+url checks

duglin · 2016-03-13T21:40:26Z

@cpuguy83 I had similar thoughts but then I assumed, perhaps incorrectly, that most of this was to be supported by the Swarm manager and therefore wouldn't be on the same system. Its only in a smaller, dev/test, environments where doing the check on the same system would be ok.

But, overall I do like the idea of adding support at some level. I just would like to see the complete picture first. For example, putting it on the "docker run" command is interesting in limited cases, but my first thought would be that it should be in a compose.yml file. There are probably a few options and it would be good to discuss the long-term vision before we start down any particular path.

elgalu · 2016-03-13T22:07:55Z

+1 on this or some sort of HEALTHCHECK instruction #21143 #7445
solving this in docker-compose would also help docker/compose#374 (comment)

dongluochen · 2016-03-14T01:19:39Z

@cpuguy83 The health check from Engine is not meant to be complete but it provides value. A container failing Engine health check shows container is not functioning as expected. This feedback is useful for a few scenarios as described above. For example, users can stop rolling update with this feedback to investigate.

Container passing health check may not be reachable from external. Orchestration tools can add external monitoring. Combining the result from Engine health check and external monitoring would help failure diagnosis.

aluzzardi · 2016-03-14T01:43:52Z

@cpuguy83 I think the name healthcheck is misleading. The goal of this system is to guarantee that the container is actually running, not that it's end-to-end reachable from the outside like a load balancer health check would do - that is out of scope.

ehazlett · 2016-03-14T13:54:29Z

+1 for this as well. I agree it's not a full solution for checking the health of the service but a good initial data point. I also think the name is misleading although I'm not coming up with anything better atm.

ghost · 2016-03-14T14:26:43Z

+1 to @aluzzardi 's idea to change to --probe-driver=[http,exec] and --probe-endpoint=[http://foo or /some/script.sh] instead of the URI syntax.

But I don't think --probe-opt exitcode=0 should exist. Too complicated. I think it should be an inherent characteristic of the exec probe that an exit code of 0 means passing. Similarly, http probe always interprets 200-399 as passing.

ghost · 2016-03-14T15:18:22Z

I think we should consider calling it --health-check. My concern is that calling it a probe obfuscates what it is. With --health-check-*, anyone will instantly understand the purpose and intended use of these flags.

I believe the main objection to "health check" is that this is not a true check of health at the application level. For instance, it doesn't ensure the container is reachable, it relies on the Engine to essentially check itself which is not robust against Engine crashes, etc. I agree with all those points. However, it seems to me that this feature is a true container-level health check: it asserts that the process inside the container is running in a manner consistent with the expectations of the image's author. Because it's being specified in the context of a docker run, I think people would understand the limited scope of this health check.

(Also, I would point out that this container-level health check doesn't preclude the creation of a higher level health check within the Swarm manager.)

Are there other downsides to the name "health check" that I'm missing? What about a more accurately scoped name like --container-health-check?

@crosbymichael @cpuguy83

mglasgow42 · 2016-03-14T17:18:59Z

Agree with @dongluochen that running the health check inside the container being monitored is suboptimal in many cases. If the container is wedged, it may not be able to update a file to indicate it is unhealthy.

Checking that a URL returns 200 may be fine if you just want to ensure a webserver is responding, but for complicated logic there should also be an ability to call a separate process outside the container being monitored. Developers can provide a separate healthcheck container for their service which does any arbitrary test that is needed, and we can examine the exitcode to see if it passed or not.

GordonTheTurtle · 2016-03-15T12:04:08Z

USER POLL

The best way to get notified of updates is to use the Subscribe button on this page.

Please don't use "+1" or "I have this too" comments on issues. We automatically
collect those comments to keep the thread short.

The people listed below have upvoted this issue by leaving a +1 comment:

@willejs

gittycat · 2016-03-15T12:41:36Z

This could finally allow docker compose to start dependent services in order.
For instance, serviceA depends on Elasticsearch which depends on Postgres.

Also, to me "health check" means a very limited scope test; basically is this service ready to accept requests and replies correctly to a request that doesn't involve any downstream services.

"Smoke tests" can be used for end to end testing.

dnephin · 2016-03-15T18:47:53Z

On the design

I think the term probe is confusing and it should just be called a health check. The system that reads the health checks is responsible for dealing with cases where the container is incapable of performing the check. Any check that fails to run is equivalent to a check that runs and reports a failure.

I think seconds is the wrong unit for some of these. It should be milliseconds, especially for timeout.

On open questions

If it were added as a container state, changes in state should be reflected in the events feed, and a new restart policy on-unhealthy could be added to have it restarted. I don't think the existing restart policies would change.

If it is not a change in container state, it shouldn't be part of the event stream, and shouldn't impact restart policies at all.

I think it would be good to make it a new state, but that it's not absolutely necessary for V1.

cpuguy83 · 2016-03-15T21:13:54Z

Sequencing dependent containers (e.g., docker/compose#374)

I think it is bad practice to encourage ready checks for sequencing container starts. This should be handled at the application level... e.g. connect-to-db->fail->loop(connect-to-db)

Taking informed load-balancing decisions

Makes sense, like an event broadcasting that the container is ready.

Restarting the container on an application-specific criteria

Kind of ick to put Docker in this category.
A ready check and a constantly running health check are very different things.
I think the latter belongs in a monitoring framework.

bobrik · 2016-03-15T22:42:05Z

I think it is bad practice to encourage ready checks for sequencing container starts. This should be handled at the application level... e.g. connect-to-db->fail->loop(connect-to-db)

How is it different from load balancing?

stevvooe · 2016-03-17T04:59:38Z

In general, I am huge fan of this idea. Providing secondary checks for a process to indicate its liveness will only help to inform the docker engine.

The great thing about this concept is that it can be complementary to an external health checking system. Coupled with a plugin system, it can operate in concert with a larger system or simply as a local liveness check. Through the event API, it can be joined with remote data to inform service discovery. Remote health checks are still required to check for service access, but this will cover an important gap at the local level.

I do hate the name "probe". A probe implies the measurement of a remote value, such as a voltage or where the rebel scum may be located.

konobi · 2016-03-25T07:49:16Z

How about checking the filesystem?

READY_ON /tmp/this_container_is_alive_and_healthy

vsaraswat · 2016-03-28T20:32:47Z

Also agree for the name "health check" over "probe." Probe is a bit vague whereas health check (even if just container level and not app level) is pretty widely understood by users

jdavisclark · 2016-03-29T14:37:30Z

Would it make more sense for the health check specification to be defined on the container being checked, and the link/dependency itself be defined on the dependent container, or is that too inflexible? This might not work well with a swarm, or otherwise outside of a local single node environment; I'm naive to the docker internals.

e.g. Ignore the method for health-checking, that part isn't super important, but assume --health-check did take multiple schemes via a protocol prefix or something; which in line with what @stevvooe said, makes for an easily extensible system:

dependency: docker run -d --name foo --health-check="port:9200" elasticsearch:latest
dependent: docker run -d --name bar --dependency=foo kibana:latest

The dependency is defined as healthy, in this instance, when port 9200 is open/available. I'm thinking "I can telnet to it now", but whatever. Then that state info can be maintained on the dependency's container state, rather than having to poll/retry the entire health check from the dependent side.
If image authors wanted to support it for standard scenarios, that exact same scheme would work in the dependencies dockerfile as well; in which case the CLI option could maybe act as an override (much like entrypoint and ports behave now)

tianon · 2016-04-12T17:00:53Z

IMO for the first pass it would be valuable to focus on defining how to determine and discover the "health" status (including how to monitor it for changes), and then separately discuss how that state impacts the rest of the system (container dependencies, etc); I fear that if we try to implement both in one shot that we'll end up hamstringing our implementation of "health checking" to solve a narrow use case (or a narrow set of use cases) 😞

zh99998 · 2016-04-22T04:25:50Z

consider provide a TCP/UDP check for non-http(s) apps, with user-defined request and response pattern.

fanktom · 2016-05-11T09:09:47Z

Instead of a check from the outside, could it make sense to be able to tell the run command when it can automatically detach?

This way, a container could be started with the usual docker run -d ..., however it would block until the started container issues some READY signal. The signal could be implemented in many ways. One way I can think of would be to declare something like a DETACHON /var/sock/some.sock. If DETACHON is not declared, the container detaches as soon as possible causing no breaking change. If DETACHON is declared, docker can watch the filesystem for a file, e.g. /var/sock/some.sock until it automatically detaches. This file can then be individually created by each image. With this approach, e.g. a database can be loaded with fixtures, and then create the READY file which is more complicated with a check from the outside in.

thaJeztah · 2016-05-13T12:40:50Z

There's a pull request opened to implement this, so anyone that's interested, PTAL at #22719

AkihiroSuda · 2016-06-03T06:51:44Z

@thaJeztah Is this issue closable? #23218

thaJeztah · 2016-06-03T08:34:41Z

Yup! Same as the other one 👍

Implemented in #23218

sanmai-NL · 2016-09-22T15:42:27Z

Where is the doc and/or schema for the output of docker inspect -f '{{json .State.Health}}' ...? I didn't see this discussed in this thread or elsewhere.

talex5 · 2016-09-22T15:59:02Z

@sanmai-NL : https://github.com/docker/docker/blob/2526ae37e98936451dcfb47b76bc851b9fabfb4a/api/types/types.go#L308

sanmai-NL · 2016-09-26T11:49:44Z

@talex5: thanks! Hope this gets documented someday.

icecrime mentioned this issue Mar 11, 2016

Proposal - Image defined probe #21143

Closed

dnephin mentioned this issue Mar 14, 2016

Is there a way to delay container startup to support dependant services with a longer startup time docker/compose#374

Closed

calavera mentioned this issue Mar 15, 2016

Add support for running a command in a container #19746

Closed

talex5 mentioned this issue May 13, 2016

Add support for user-defined healthchecks #22719

Closed

thaJeztah added the kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny label May 13, 2016

robfrank mentioned this issue May 17, 2016

Getting some errors orientechnologies/orientdb-docker#10

Closed

thaJeztah added this to the 1.12.0 milestone Jun 3, 2016

thaJeztah closed this as completed Jun 3, 2016

thaJeztah mentioned this issue Jun 3, 2016

[Carry 22719] healthcheck feature #23218

Merged

vshcherb mentioned this issue Sep 17, 2016

No possibility to specify grace period in the Healthcheck for service startup #26664

Closed

dongluochen mentioned this issue Feb 15, 2017

Grace period option to health checks. #28938

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal - Application-defined "alive probe" #21142

Proposal - Application-defined "alive probe" #21142

icecrime commented Mar 11, 2016

tianon commented Mar 12, 2016

aluzzardi commented Mar 13, 2016

cpuguy83 commented Mar 13, 2016

duglin commented Mar 13, 2016

elgalu commented Mar 13, 2016

dongluochen commented Mar 14, 2016

aluzzardi commented Mar 14, 2016

ehazlett commented Mar 14, 2016

ghost commented Mar 14, 2016

ghost commented Mar 14, 2016

mglasgow42 commented Mar 14, 2016

GordonTheTurtle commented Mar 15, 2016

gittycat commented Mar 15, 2016

dnephin commented Mar 15, 2016

cpuguy83 commented Mar 15, 2016

bobrik commented Mar 15, 2016

stevvooe commented Mar 17, 2016

konobi commented Mar 25, 2016

vsaraswat commented Mar 28, 2016

jdavisclark commented Mar 29, 2016

tianon commented Apr 12, 2016

zh99998 commented Apr 22, 2016 •

edited

fanktom commented May 11, 2016 •

edited

thaJeztah commented May 13, 2016

AkihiroSuda commented Jun 3, 2016

thaJeztah commented Jun 3, 2016

sanmai-NL commented Sep 22, 2016 •

edited

talex5 commented Sep 22, 2016

sanmai-NL commented Sep 26, 2016

Proposal - Application-defined "alive probe" #21142

Proposal - Application-defined "alive probe" #21142

Comments

icecrime commented Mar 11, 2016

Problem statement

Proposal

Open questions

References

tianon commented Mar 12, 2016

aluzzardi commented Mar 13, 2016

cpuguy83 commented Mar 13, 2016

duglin commented Mar 13, 2016

elgalu commented Mar 13, 2016

dongluochen commented Mar 14, 2016

aluzzardi commented Mar 14, 2016

ehazlett commented Mar 14, 2016

ghost commented Mar 14, 2016

ghost commented Mar 14, 2016

mglasgow42 commented Mar 14, 2016

GordonTheTurtle commented Mar 15, 2016

gittycat commented Mar 15, 2016

dnephin commented Mar 15, 2016

cpuguy83 commented Mar 15, 2016

bobrik commented Mar 15, 2016

stevvooe commented Mar 17, 2016

konobi commented Mar 25, 2016

vsaraswat commented Mar 28, 2016

jdavisclark commented Mar 29, 2016

tianon commented Apr 12, 2016

zh99998 commented Apr 22, 2016 • edited

fanktom commented May 11, 2016 • edited

thaJeztah commented May 13, 2016

AkihiroSuda commented Jun 3, 2016

thaJeztah commented Jun 3, 2016

sanmai-NL commented Sep 22, 2016 • edited

talex5 commented Sep 22, 2016

sanmai-NL commented Sep 26, 2016

zh99998 commented Apr 22, 2016 •

edited

fanktom commented May 11, 2016 •

edited

sanmai-NL commented Sep 22, 2016 •

edited