Set matching cache status http code #330

ruial · 2022-08-30T13:25:45Z

This endpoint always returns status 200 even if 1 or more caches are not alive. With this change, if 1 of the caches is dead, it will return status 500, which can then be used by health checking tools to restart the service.

I have previously monkey patched #292 into the service I maintain but I still found issues where 1+ cache was not restarted correctly. This is hard to reproduce and happens from time to time. With this change, I no longer need to restart the Metaflow service manually.

jackie-ob · 2022-08-30T15:40:32Z

It feels a little janky to return a 500 status code just as a function of the response content here. Is it a possibility for the health check tool to use other info on the response? E.g.

Health check tool inspects the body and does the all(is_alive) check itself.
Metadata service can set an all_caches_alive flag either in response body, or in response header.
A new endpoint made to indicate the need to restart. We have /ping for this purpose today. However we can consider another endpoint with more sophisticated internal criteria.

ruial · 2022-08-30T15:56:01Z

Hey @jackie-ob , setting the HTTP status code >= 400 is the standard way to signal that a service is unhealthy (in AWS ELB and k8s):

In the case of ELB I don't think the response body can be evaluated to determine the service health.

You guys are also using this technique correctly in other places, see https://github.com/Netflix/metaflow-service/blob/master/services/metadata_service/api/admin.py#L59

Please let me know if this MR makes sense or if I should change something.

romain-intel · 2022-08-30T15:57:47Z

What issues were you seeing in the cache? I believe they may be addressed in #327 which should keep the caches alive.

jackie-ob · 2022-08-30T16:03:37Z

@ruial we are planning to ship #327 as @romain-intel mentioned. Today. Would you like to see if that makes the need for this PR go away altogether?

If it is still a problem, our preference would be to create a new endpoint (similar to what you referenced)

ruial · 2022-08-30T16:11:51Z

Sure, I can try the new version, I think the fix from https://github.com/Netflix/metaflow-service/pull/292/files#diff-17bee700a8b2eb91cc38dcb3d7e30cd88769966fdb21c724393a5746f664d3cf is not included tho. A new endpoint in the future with the purpose of health checking the cache (and maybe the database) would also work. The current issue that I have is that all the cache workers subprocesses get terminated and don't restart, so everything in the UI gets a CacheServerUnreachable exception. I don't know how to reproduce the issue as it happens rarely.

Everything else with Metaflow works well, the only issues we have currently are the cache. Thank you :)

Set matching cache status http code

75e79e3

Update admin.py

0cc23cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set matching cache status http code #330

Set matching cache status http code #330

ruial commented Aug 30, 2022 •

edited

jackie-ob commented Aug 30, 2022 •

edited

ruial commented Aug 30, 2022

romain-intel commented Aug 30, 2022

jackie-ob commented Aug 30, 2022

ruial commented Aug 30, 2022

Set matching cache status http code #330

Are you sure you want to change the base?

Set matching cache status http code #330

Conversation

ruial commented Aug 30, 2022 • edited

jackie-ob commented Aug 30, 2022 • edited

ruial commented Aug 30, 2022

romain-intel commented Aug 30, 2022

jackie-ob commented Aug 30, 2022

ruial commented Aug 30, 2022

ruial commented Aug 30, 2022 •

edited

jackie-ob commented Aug 30, 2022 •

edited