Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Introduce different states to prevent clusters from being turned inactive programmatically #80

Open
andythsu opened this issue Oct 26, 2023 · 7 comments

Comments

@andythsu
Copy link
Member

andythsu commented Oct 26, 2023

Problem: Currently, if the backends are not responding to healthcheck, trino-gateway will inactivate the backends. Since healthcheck only checks against active backends, inactive backends will never be turned to active again.

Solution: it is better to introduce healthy/pending/unhealthy states depending on the health of the backends. These states will be independent from the active/inactive states, so it ensures that unhealthy but active backends can still be checked and possibly turned active again.

Healthy state is defined as backends are healthy and ready to be served
Pending state is defined as when the backend is switched from inactive to active. It will wait until healthcheck returns success before turning backend to healthy
Unhealthy state is defined as backends are returning error or not responding to healthcheck. At this point, if the backends are still active, healthcheck will still check on these backends.

@andythsu
Copy link
Member Author

When backend is active and health check passes, health state should be HEALTHY
image

When backend is inactive, health state should be PENDING
image

When backend is turned active from inactive, health state should be PENDING until health check passes
image

andythsu pushed a commit to andythsu/trino-gateway that referenced this issue Nov 13, 2023
This commit adds a 'Pending' state to keep track of backend's health.

Resolves: trinodb#80
andythsu added a commit to andythsu/trino-gateway that referenced this issue Nov 13, 2023
This commit adds a 'Pending' state to keep track of backend's health.

Resolves: trinodb#80
andythsu added a commit to andythsu/trino-gateway that referenced this issue Nov 13, 2023
This commit adds a 'Pending' state to keep track of backend's health.

Resolves: trinodb#80
andythsu added a commit to andythsu/trino-gateway that referenced this issue Nov 13, 2023
This commit adds a 'Pending' state to keep track of backend's health.

Resolves: trinodb#80
@willmostly
Copy link
Contributor

Instead of deactivating backends, we could switch to using the TrinoQueueLengthRoutingTable as default. This routing strategy will automatically ignore any backend that does not respond with the number of running and queued queries.

@Chaho12
Copy link
Member

Chaho12 commented Nov 15, 2023

At first I thought gateway only checks against active backends because when user disables backends, e.g for deployment purpose, it should not go back to enabled status automatically. A manual interaction is needed.

unhealthy but active backends can still be checked and possibly turned active again.

So, if i understood it correctly, healthy status checks only to those who has active status as true right? Those in active as false is not checked? Am just clarifying what you intended :)

@andythsu
Copy link
Member Author

andythsu commented Nov 15, 2023

At first I thought gateway only checks against active backends because when user disables backends, e.g for deployment purpose, it should not go back to enabled status automatically. A manual interaction is needed.

unhealthy but active backends can still be checked and possibly turned active again.

So, if i understood it correctly, healthy status checks only to those who has active status as true right? Those in active as false is not checked? Am just clarifying what you intended :)

this is correct. ActiveClusterMonitor only checks against active backends.

This feature allows active backends to revive and starts getting requests by itself if it was unhealthy for any reason.

@andythsu
Copy link
Member Author

andythsu commented Nov 28, 2023

Instead of deactivating backends, we could switch to using the TrinoQueueLengthRoutingTable as default. This routing strategy will automatically ignore any backend that does not respond with the number of running and queued queries.

@willmostly I think it adds value to UX improvement because deactivating backends also shows backends' health as "unhealthy" on UI. It lets users know that the backends are down at the moment.

Open to any other discussions.

@rdsarvar
Copy link
Contributor

Instead of deactivating backends, we could switch to using the TrinoQueueLengthRoutingTable as default. This routing strategy will automatically ignore any backend that does not respond with the number of running and queued queries.

@willmostly out of curiosity what would this implementation do if the active node count is 0 for a coordinator, would it still route because running == queued == 0?

my vote (albeit not important) is the current implementation you've went with because it doesn't affect routing and fixes (IMO) a broken functionality where backends can't self heal

The only risky part of this implementation that I can see is that a flapping backend can continually rejoin the routing, accept + fail queries, drop as unhealthy, repeat. But I believe the benefit of a healthy backend being able to rejoin routing after a false positive health failure out weights the cons.

@jeffreychucASAPP
Copy link

jeffreychucASAPP commented Dec 14, 2023

We actually implemented this feature on our internal fork of trino-gateway, I'll see about getting it cleared and uploaded here as a branch if there's any interest. This feature was particularly important for our deployment as we have this deployed on k8s and we'll have redeployments of our trino clusters that would bring them in and out of "active".

-edit-

just noticed that you had a pr open for this! so cool

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
5 participants