Feature request: Introduce different states to prevent clusters from being turned inactive programmatically #80

andythsu · 2023-10-26T23:56:13Z

Problem: Currently, if the backends are not responding to healthcheck, trino-gateway will inactivate the backends. Since healthcheck only checks against active backends, inactive backends will never be turned to active again.

Solution: it is better to introduce healthy/pending/unhealthy states depending on the health of the backends. These states will be independent from the active/inactive states, so it ensures that unhealthy but active backends can still be checked and possibly turned active again.

Healthy state is defined as backends are healthy and ready to be served
Pending state is defined as when the backend is switched from inactive to active. It will wait until healthcheck returns success before turning backend to healthy
Unhealthy state is defined as backends are returning error or not responding to healthcheck. At this point, if the backends are still active, healthcheck will still check on these backends.

The text was updated successfully, but these errors were encountered:

andythsu · 2023-11-12T10:44:31Z

When backend is active and health check passes, health state should be HEALTHY

When backend is inactive, health state should be PENDING

When backend is turned active from inactive, health state should be PENDING until health check passes

This commit adds a 'Pending' state to keep track of backend's health. Resolves: trinodb#80

willmostly · 2023-11-14T17:53:34Z

Instead of deactivating backends, we could switch to using the TrinoQueueLengthRoutingTable as default. This routing strategy will automatically ignore any backend that does not respond with the number of running and queued queries.

Chaho12 · 2023-11-15T00:38:24Z

At first I thought gateway only checks against active backends because when user disables backends, e.g for deployment purpose, it should not go back to enabled status automatically. A manual interaction is needed.

unhealthy but active backends can still be checked and possibly turned active again.

So, if i understood it correctly, healthy status checks only to those who has active status as true right? Those in active as false is not checked? Am just clarifying what you intended :)

andythsu · 2023-11-15T14:43:56Z

At first I thought gateway only checks against active backends because when user disables backends, e.g for deployment purpose, it should not go back to enabled status automatically. A manual interaction is needed.

unhealthy but active backends can still be checked and possibly turned active again.

So, if i understood it correctly, healthy status checks only to those who has active status as true right? Those in active as false is not checked? Am just clarifying what you intended :)

this is correct. ActiveClusterMonitor only checks against active backends.

trino-gateway/gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ActiveClusterMonitor.java

Line 62 in e155380

List<ProxyBackendConfiguration> activeClusters =

This feature allows active backends to revive and starts getting requests by itself if it was unhealthy for any reason.

andythsu · 2023-11-28T16:52:28Z

Instead of deactivating backends, we could switch to using the TrinoQueueLengthRoutingTable as default. This routing strategy will automatically ignore any backend that does not respond with the number of running and queued queries.

@willmostly I think it adds value to UX improvement because deactivating backends also shows backends' health as "unhealthy" on UI. It lets users know that the backends are down at the moment.

Open to any other discussions.

rdsarvar · 2023-11-29T03:32:38Z

Instead of deactivating backends, we could switch to using the TrinoQueueLengthRoutingTable as default. This routing strategy will automatically ignore any backend that does not respond with the number of running and queued queries.

@willmostly out of curiosity what would this implementation do if the active node count is 0 for a coordinator, would it still route because running == queued == 0?

my vote (albeit not important) is the current implementation you've went with because it doesn't affect routing and fixes (IMO) a broken functionality where backends can't self heal

The only risky part of this implementation that I can see is that a flapping backend can continually rejoin the routing, accept + fail queries, drop as unhealthy, repeat. But I believe the benefit of a healthy backend being able to rejoin routing after a false positive health failure out weights the cons.

jeffreychucASAPP · 2023-12-14T22:52:29Z

We actually implemented this feature on our internal fork of trino-gateway, I'll see about getting it cleared and uploaded here as a branch if there's any interest. This feature was particularly important for our deployment as we have this deployed on k8s and we'll have redeployments of our trino clusters that would bring them in and out of "active".

-edit-

just noticed that you had a pr open for this! so cool

andythsu pushed a commit to andythsu/trino-gateway that referenced this issue Nov 13, 2023

Introduced 'Pending' state for backend's health

9df6315

This commit adds a 'Pending' state to keep track of backend's health. Resolves: trinodb#80

andythsu mentioned this issue Nov 13, 2023

Introduced 'Pending' state for backend's health #91

Closed

andythsu added a commit to andythsu/trino-gateway that referenced this issue Nov 13, 2023

Introduced 'Pending' state for backend's health

558415f

This commit adds a 'Pending' state to keep track of backend's health. Resolves: trinodb#80

andythsu added a commit to andythsu/trino-gateway that referenced this issue Nov 13, 2023

Introduced 'Pending' state for backend's health

9f787d4

This commit adds a 'Pending' state to keep track of backend's health. Resolves: trinodb#80

andythsu added a commit to andythsu/trino-gateway that referenced this issue Nov 13, 2023

Introduced 'Pending' state for backend's health

db5c94c

This commit adds a 'Pending' state to keep track of backend's health. Resolves: trinodb#80

This was referenced Jan 17, 2024

introduced 'Pending' state for backend's health #163

Closed

[Feature] New Web UI #116

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Introduce different states to prevent clusters from being turned inactive programmatically #80

Feature request: Introduce different states to prevent clusters from being turned inactive programmatically #80

andythsu commented Oct 26, 2023 •

edited

andythsu commented Nov 12, 2023

willmostly commented Nov 14, 2023

Chaho12 commented Nov 15, 2023

andythsu commented Nov 15, 2023 •

edited

andythsu commented Nov 28, 2023 •

edited

rdsarvar commented Nov 29, 2023

jeffreychucASAPP commented Dec 14, 2023 •

edited

Feature request: Introduce different states to prevent clusters from being turned inactive programmatically #80

Feature request: Introduce different states to prevent clusters from being turned inactive programmatically #80

Comments

andythsu commented Oct 26, 2023 • edited

andythsu commented Nov 12, 2023

willmostly commented Nov 14, 2023

Chaho12 commented Nov 15, 2023

andythsu commented Nov 15, 2023 • edited

andythsu commented Nov 28, 2023 • edited

rdsarvar commented Nov 29, 2023

jeffreychucASAPP commented Dec 14, 2023 • edited

andythsu commented Oct 26, 2023 •

edited

andythsu commented Nov 15, 2023 •

edited

andythsu commented Nov 28, 2023 •

edited

jeffreychucASAPP commented Dec 14, 2023 •

edited