Allow failure threshold for application health check #3435

zuesxiao · 2023-09-13T01:20:53Z

Issue

We observed a lot of application crashes due to health check (with http request) timeout, but all other http requests were actually working right before crashes, and also all other metrics were good.

Actually, our health check endpoint is quite fast without any other logic, and also we increased the timeout to 20 seconds but it doesn't help too much.

Expected result

We are not sure why some of health checks fail due to timeout, might be CPU throttling.
But we expect that the runtime gives the application another chance to do another health check.

Current result

Application instance would be restarted after only one single health check failure.

Possible Fix

Adding failure threshold for health check, and after health check fails failureThreshold times in a row, the runtime considers that the overall check has failed and the container is not healthy/live.

philippthun · 2023-09-13T07:58:45Z

When you write we increased the timeout to 20 seconds, do you mean the health-check-invocation-timeout?
Quite recently the first step towards another check type was completed, i.e. readiness health checks. Maybe you could have a look at https://github.com/cloudfoundry/community/blob/main/toc/rfc/rfc-0020-readiness-healthchecks.md and Readiness health checks #3351.

zuesxiao · 2023-09-15T09:16:52Z

Hi @philippthun
Thanks for your response.
I'm curious if readiness health check would be executed during the whole life cycle of application, or just startup phase?
Thanks.

philippthun · 2023-09-15T11:16:11Z

The readiness health checks are always executed. Failing readiness health checks mean that an app instance will not be accessible (i.e. no traffic will be routed). This can happen at any time during the lifecycle. But the app process will not be restarted (that's the difference compared to the other health checks).

zuesxiao · 2023-09-18T02:13:54Z

HI @philippthun
Readiness health check is really good. But I still have a question after walked through that RFC.
So if readiness check fails, the app instance would not be accessible, will CF runtime stop check health against those "not ready" app instances?
I'm asking because, if CF runtime still execute health check and fail, the app instance get restarted anyway. If "not ready" app instance is not in scope of health check, that would be great.

My original requirement is that CF runtime would give app instance a chance to recover from failing health check instead of being restart. Like Kubernetes it provides failureThreshold in this situation.

cf-gitbot added the unscheduled label Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow failure threshold for application health check #3435

Allow failure threshold for application health check #3435

zuesxiao commented Sep 13, 2023 •

edited

philippthun commented Sep 13, 2023

zuesxiao commented Sep 15, 2023

philippthun commented Sep 15, 2023

zuesxiao commented Sep 18, 2023 •

edited

Allow failure threshold for application health check #3435

Allow failure threshold for application health check #3435

Comments

zuesxiao commented Sep 13, 2023 • edited

Issue

Expected result

Current result

Possible Fix

philippthun commented Sep 13, 2023

zuesxiao commented Sep 15, 2023

philippthun commented Sep 15, 2023

zuesxiao commented Sep 18, 2023 • edited

zuesxiao commented Sep 13, 2023 •

edited

zuesxiao commented Sep 18, 2023 •

edited