Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow failure threshold for application health check #3435

Open
zuesxiao opened this issue Sep 13, 2023 · 4 comments
Open

Allow failure threshold for application health check #3435

zuesxiao opened this issue Sep 13, 2023 · 4 comments

Comments

@zuesxiao
Copy link

zuesxiao commented Sep 13, 2023

Issue

We observed a lot of application crashes due to health check (with http request) timeout, but all other http requests were actually working right before crashes, and also all other metrics were good.

Actually, our health check endpoint is quite fast without any other logic, and also we increased the timeout to 20 seconds but it doesn't help too much.

Expected result

We are not sure why some of health checks fail due to timeout, might be CPU throttling.
But we expect that the runtime gives the application another chance to do another health check.

Current result

Application instance would be restarted after only one single health check failure.

Possible Fix

Adding failure threshold for health check, and after health check fails failureThreshold times in a row, the runtime considers that the overall check has failed and the container is not healthy/live.

@philippthun
Copy link
Member

  1. When you write we increased the timeout to 20 seconds, do you mean the health-check-invocation-timeout?
  2. Quite recently the first step towards another check type was completed, i.e. readiness health checks. Maybe you could have a look at https://github.com/cloudfoundry/community/blob/main/toc/rfc/rfc-0020-readiness-healthchecks.md and Readiness health checks #3351.

@zuesxiao
Copy link
Author

Hi @philippthun
Thanks for your response.
I'm curious if readiness health check would be executed during the whole life cycle of application, or just startup phase?
Thanks.

@philippthun
Copy link
Member

The readiness health checks are always executed. Failing readiness health checks mean that an app instance will not be accessible (i.e. no traffic will be routed). This can happen at any time during the lifecycle. But the app process will not be restarted (that's the difference compared to the other health checks).

@zuesxiao
Copy link
Author

zuesxiao commented Sep 18, 2023

HI @philippthun
Readiness health check is really good. But I still have a question after walked through that RFC.
So if readiness check fails, the app instance would not be accessible, will CF runtime stop check health against those "not ready" app instances?
I'm asking because, if CF runtime still execute health check and fail, the app instance get restarted anyway. If "not ready" app instance is not in scope of health check, that would be great.

My original requirement is that CF runtime would give app instance a chance to recover from failing health check instead of being restart. Like Kubernetes it provides failureThreshold in this situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants