New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Introduce different states to prevent clusters from being turned inactive programmatically #80
Comments
This commit adds a 'Pending' state to keep track of backend's health. Resolves: trinodb#80
This commit adds a 'Pending' state to keep track of backend's health. Resolves: trinodb#80
This commit adds a 'Pending' state to keep track of backend's health. Resolves: trinodb#80
This commit adds a 'Pending' state to keep track of backend's health. Resolves: trinodb#80
Instead of deactivating backends, we could switch to using the |
At first I thought gateway only checks against active backends because when user disables backends, e.g for deployment purpose, it should not go back to enabled status automatically. A manual interaction is needed.
So, if i understood it correctly, healthy status checks only to those who has |
this is correct. trino-gateway/gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ActiveClusterMonitor.java Line 62 in e155380
This feature allows active backends to revive and starts getting requests by itself if it was unhealthy for any reason. |
@willmostly I think it adds value to UX improvement because deactivating backends also shows backends' health as "unhealthy" on UI. It lets users know that the backends are down at the moment. Open to any other discussions. |
@willmostly out of curiosity what would this implementation do if the active node count is 0 for a coordinator, would it still route because running == queued == 0? my vote (albeit not important) is the current implementation you've went with because it doesn't affect routing and fixes (IMO) a broken functionality where backends can't self heal The only risky part of this implementation that I can see is that a flapping backend can continually rejoin the routing, accept + fail queries, drop as unhealthy, repeat. But I believe the benefit of a healthy backend being able to rejoin routing after a false positive health failure out weights the cons. |
We actually implemented this feature on our internal fork of trino-gateway, I'll see about getting it cleared and uploaded here as a branch if there's any interest. This feature was particularly important for our deployment as we have this deployed on k8s and we'll have redeployments of our trino clusters that would bring them in and out of "active". -edit- just noticed that you had a pr open for this! so cool |
Problem: Currently, if the backends are not responding to healthcheck, trino-gateway will inactivate the backends. Since healthcheck only checks against active backends, inactive backends will never be turned to active again.
Solution: it is better to introduce healthy/pending/unhealthy states depending on the health of the backends. These states will be independent from the active/inactive states, so it ensures that unhealthy but active backends can still be checked and possibly turned active again.
Healthy state
is defined as backends are healthy and ready to be servedPending state
is defined as when the backend is switched from inactive to active. It will wait until healthcheck returns success before turning backend to healthyUnhealthy state
is defined as backends are returning error or not responding to healthcheck. At this point, if the backends are still active, healthcheck will still check on these backends.The text was updated successfully, but these errors were encountered: