Ready check does not include current database connectivity #831

Waidmann · 2022-02-07T16:07:06Z

Preflight checklist

I could not find a solution in the existing issues, docs, nor discussions.
I agree to follow this project's Code of Conduct.
I have read and am following this repository's Contribution Guidelines.
This issue affects my Ory Cloud project.
I have joined the Ory Community Slack.
I am signed up to the Ory Security Patch Newsletter.

Describe the bug

The health/ready endpoint returns OK when database connectivity is no longer given. I would expect it to check this because according to the docs: This endpoint returns a 200 status code when the HTTP server is up running and the environment dependencies (e.g. the database) are responsive as well..

Reproducing the bug

Setup postgres service in k8s cluster
Deploy keto to cluster with dsn pointing to postgres service
Kill postgres
Call keto 'health/ready' endpoint -> Returns OK

However when I try to insert/query tuples I will obviously be greeted with an error code.

Relevant log output

No response

Relevant configuration

No response

Version

0.6.0-alpha.1

On which operating system are you observing this issue?

No response

In which environment are you deploying?

Kubernetes with Helm

Additional Context

No response

The text was updated successfully, but these errors were encountered:

zepatrik · 2022-02-17T09:48:54Z

Good point, that should really be the case.

zepatrik · 2022-02-23T09:11:01Z

The ready-checkers are registered here:

keto/internal/driver/registry_default.go

Line 88 in e9e6385

    
           r.healthH = healthx.NewHandler(r.Writer(), config.Version, healthx.ReadyCheckers{})

Currently none are registered, which means that Keto appears healthy as soon as it runs.

nickjn92 · 2022-03-15T08:15:16Z

From a kubernetes point of view, you dont want to include external dependencies, such as a database, in your readiness checks. Otherwise you might end up a in a cascading failures scenario where all pods are taken down and are unable to serve requests, and you are greeted with some generic error that does'nt really inform about whats causing the issue.
I believe the best practice is to rely on monitoring to determine whats causing the errors and if you need to wait for database to be up you can use a initContainer or lifecycle hooks

zepatrik · 2022-03-15T08:19:54Z

Interesting standpoint, maybe @Demonsthere can give his opinion on this? Keto is generally not able to serve any request without a working database connection. Init migration jobs will also not complete, so you will end up in an error loop anyways on helm install.
But yeah, killing a pod just because the database is unavailable is also not helpful 🤔

Demonsthere · 2022-03-15T09:51:51Z

Imho, from a deployment perspective:

keto should report a ready check once it has started and is running in a stable state, as we use a init job, which has to connect to the DB, and without which the keto main deployment won't even start, we kinda assume that if keto is running then the connection to db must have been working at least for the migration part. This could be improved with a verification in the ready check if we can open a connection to the DB
as for periodic health checks, imho a periodic downtime of the db can always happen, and as pointed we should not cascade restart all pods because of that, but maybe mark the pod as unhealthy with a more specific health check?

zepatrik · 2022-03-15T09:56:09Z

Sounds good, so basically we would ping the database on startup and report as ready once that succeeded. Further ready checks will not ping the database again, but always return true.
Later we can add a check that pings the db periodically.

mstrYoda · 2022-06-13T20:33:31Z

In Kubernetes, we can define the failure threshold to retry before restarting pods. Also, we can define initialDelaySeconds to wait for some operational tasks to be complete before sending health/readiness requests.

IMHO, I think that adding database health check might be good as well.

Demonsthere · 2022-06-14T06:46:40Z

In the helm charts the values for probes are exposed and can be configured to your liking :)

Demonsthere · 2022-08-10T09:19:19Z

Edit: we actually run into a related issue some time ago 😅 which caused us to rethink the setup a bit. We now have exposed the option to change the probes to custom ones, as seen here in kratos, and will work on reworking the healthchecks in general

aeneasr · 2023-01-18T15:06:24Z

Isn't this solved now? I think one of the probes now checks DB connectivity

zepatrik · 2023-01-19T09:58:16Z

They would have to be added here right?

keto/internal/driver/registry_default.go

Line 122 in 9215c06

    
           r.healthH = healthx.NewHandler(r.Writer(), config.Version, healthx.ReadyCheckers{})

Maybe that was a different project, and we can transfer the change?

aeneasr · 2023-01-19T10:02:36Z

:O Yes, definitely, that needs to be checked! Otherwise we could run into an outage if we encounter one of those SQL connection bugs with cockroach that need a pod restart

https://github.com/ory/kratos/blob/4181fbc381b46df5cd79941f20fc885c7a1e1b47/driver/registry_default.go#L255-L273

Closes #831

aran · 2023-12-14T23:28:11Z

I just ran into an issue using Postgresql as backend, with calls to Keto reporting something like:

unable to fetch records...terminating connection due to administrator command (SQLSTATE 57P01) with gRPC code Unknown.

DB was up and retries didn't work. However, restarting the pod worked. I am wondering if there's a chance of this issue making it over the finish line?

Waidmann added the bug Something is not working. label Feb 7, 2022

zepatrik added good first issue A good issue to tackle when being a novice to the project. help wanted We are looking for help on this one. labels Feb 17, 2022

aeneasr added a commit that referenced this issue Jul 24, 2023

feat: health checks need to consider DB health

6a497be

Closes #831

aeneasr linked a pull request Jul 24, 2023 that will close this issue

feat: health checks need to consider DB health #1382

Open

7 tasks

aeneasr self-assigned this Jul 24, 2023

This comment was marked as duplicate.

Sign in to view

jonas-jonas assigned jonas-jonas and aeneasr and unassigned aeneasr and jonas-jonas Aug 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ready check does not include current database connectivity #831

Ready check does not include current database connectivity #831

Waidmann commented Feb 7, 2022

zepatrik commented Feb 17, 2022

zepatrik commented Feb 23, 2022

nickjn92 commented Mar 15, 2022

zepatrik commented Mar 15, 2022

Demonsthere commented Mar 15, 2022

zepatrik commented Mar 15, 2022

mstrYoda commented Jun 13, 2022

Demonsthere commented Jun 14, 2022

Demonsthere commented Aug 10, 2022

aeneasr commented Jan 18, 2023

zepatrik commented Jan 19, 2023

aeneasr commented Jan 19, 2023

This comment was marked as duplicate.

aran commented Dec 14, 2023

Ready check does not include current database connectivity #831

Ready check does not include current database connectivity #831

Comments

Waidmann commented Feb 7, 2022

Preflight checklist

Describe the bug

Reproducing the bug

Relevant log output

Relevant configuration

Version

On which operating system are you observing this issue?

In which environment are you deploying?

Additional Context

zepatrik commented Feb 17, 2022

zepatrik commented Feb 23, 2022

nickjn92 commented Mar 15, 2022

zepatrik commented Mar 15, 2022

Demonsthere commented Mar 15, 2022

zepatrik commented Mar 15, 2022

mstrYoda commented Jun 13, 2022

Demonsthere commented Jun 14, 2022

Demonsthere commented Aug 10, 2022

aeneasr commented Jan 18, 2023

zepatrik commented Jan 19, 2023

aeneasr commented Jan 19, 2023

This comment was marked as duplicate.

aran commented Dec 14, 2023