Trino gateway health state design #222

andythsu · 2024-01-30T18:26:33Z

Problem: Currently the health state of trino cluster is stored in-memory in trino-gateway. This is a problem because there can be multiple instances of trino-gateway, and their trino cluster states can be inconsistent. This can lead to displaying different information on gateway UI.

Solution: Create a DB to store health state of trino cluster for ALL instances of trino-gateway. The DB will have a table named backend_state, with the table schema (name: String, is_healthy: boolean, last_update_time: timestamp)

Each trino-gateway instance should have different taskDelayMin, as defined here:

trino-gateway/gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ActiveClusterMonitor.java

Line 35 in 9bbf62c

public static final int MONITOR_TASK_DELAY_MIN = 1;

(Note: currently this line is not configurable, but we will create a PR to make it configurable.)

Then, once trino-gateway is up, it will get health state from the DB before checking with trino clusters at intervals of taskDelayMin. If the last_update_time is greater than X amount of time (this will also be configurable) or if is_healthy=false, then trino-gateway will get the health status from trino backend and store the result in DB. Otherwise, it will just use the state from the DB

If trino-gateway sees the health state is unhealthy, then it will maintain an in-memory storage where it will make Y tries (again, configurable) to trino backend. Trino-gateway will only treat trino backend as healthy AFTER it has Z successes (again, configurable) from the trino backend. After it reaches Z successes, trino-gateway will update the trino backend record in DB.

In the beginning, all trino backends will start with PENDING in DB.

The text was updated successfully, but these errors were encountered:

rdsarvar · 2024-02-22T23:48:30Z

Spitballing a scenario here -

What happens in the case that Kube/Core DNS (or anything node specific tbh) fails to resolve FQDNs for one of the N nodes? Would this result in total outage as the single node running Gateway would continuously set all backends to failed in the central store?

andythsu · 2024-05-01T19:37:00Z

Spitballing a scenario here -

What happens in the case that Kube/Core DNS (or anything node specific tbh) fails to resolve FQDNs for one of the N nodes? Would this result in total outage as the single node running Gateway would continuously set all backends to failed in the central store?

no, gateway will use the api provided by trino to determine trino's state. In the case where FQDNs fail, it should result in another error

andythsu · 2024-05-01T19:42:01Z

Please see https://lucid.app/lucidchart/1058db28-d6f7-4d35-95db-1231e295b989/edit?viewport_loc=-128%2C270%2C2003%2C1154%2C0_0&invitationId=inv_e2f1449c-560e-458d-81e2-cffc0d2b98e7 for the flow chart of this design.

If trino-gateway sees the health state is unhealthy, then it will maintain an in-memory storage where it will make Y tries (again, configurable) to trino backend. Trino-gateway will only treat trino backend as healthy AFTER it has Z successes (again, configurable) from the trino backend. After it reaches Z successes, trino-gateway will update the trino backend record in DB.

This part is not included in the flow chart because I'm not sure if this is still relevant. We can omit this part to keep the logic simple or we can include this part to make gateway a bit more intelligent.

andythsu · 2024-05-01T19:52:01Z

One note: Right now the logic to fetch RUNNING queries and QUEUED queries is coupled with setting healthy state. See

trino-gateway/gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClusterStatsJdbcMonitor.java

Lines 85 to 87 in b616d40

    
           .healthy(true) 
        
           .queuedQueryCount(partialState.getOrDefault("QUEUED", 0)) 
        
           .runningQueryCount(partialState.getOrDefault("RUNNING", 0))

for ClusterStatsJdbcMonitor and

trino-gateway/gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClusterStatsHttpMonitor.java

Lines 70 to 77 in b616d40

    
           clusterStats 
        
                   .numWorkerNodes(activeWorkers) 
        
                   .queuedQueryCount((int) result.get("queuedQueries")) 
        
                   .runningQueryCount((int) result.get("runningQueries")) 
        
                   .healthy(activeWorkers > 0) 
        
                   .proxyTo(backend.getProxyTo()) 
        
                   .externalUrl(backend.getExternalUrl()) 
        
                   .routingGroup(backend.getRoutingGroup());

for ClusterStatsHttpMonitor

If we are storing health state in the db that means we will need to decouple setting health from setting RUNNING/QUEUED queries

andythsu · 2024-05-01T20:03:37Z

Another note: once the admin deactivates the trino backend, trino-gateway should clear that trino backend's state from shared DB and its local cache.

Chaho12 · 2024-05-01T23:55:22Z

Sounds all good except one thing. I am not so sure about creating a new table.
What would be the reasons for not adding 2 fields to gateway_backend table?

andythsu · 2024-05-06T13:22:27Z

Sounds all good except one thing. I am not so sure about creating a new table. What would be the reasons for not adding 2 fields to gateway_backend table?

yes we could do that

oneonestar · 2024-05-08T23:47:39Z

I have the same concern as @rdsarvar.
Health state should not be global unless we have some kind of quorum mechanism agree on the status of Trino.

The following happened in our production environment.

                 /-----> Gateway-1 -----> Trino-1
Client -> LoadBalancer-> Gateway-2 -----> Trino-1
                 \-----> Gateway-3 -(x)-> Trino-1 (due to network issue)

If we have a global health status in database, gateway-3 would continuously set Trino-1 to failed, but the truth is only Gateway-3 to Trino-1 failed. It will also be hard to find out which gateway updated the record.

I would rather like to have different information on gateway UI, so that I can figure out the issue on Gateway-3 quickly.

andythsu · 2024-05-09T07:40:00Z

I'm leaning towards @oneonestar 's way for this issue tbh. On a second thought, the results from having the gateway check DB for trino states could be misleading. For example, if gateway cannot reach trino due to network issue but it still reports trino as healthy because it's only checking DB without actually interacting with trino, it could lead to gateway route the query to the unreachable trino. @willmostly

Chaho12 · 2024-05-09T08:17:19Z

I would rather like to have different information on gateway UI, so that I can figure out the issue on Gateway-3 quickly.

How would this work for different gateway servers when we login to web ui?
How about saving each gateway server status to db and show green/red buttons for each servers like below.

cluster(group) |                            Active                        |
   adhoc          gateway 1 (green) | gateway 2 (green) | gateway 3 (red)

so that admin will know which server has issue + when routing to adhoc group, it will not route to gateway 3

oneonestar · 2024-05-09T11:51:11Z

How about saving each gateway server status to db and show green/red buttons for each servers like below.

That won't work well in dynamic environment such as K8S + Auto Scaling.

We can detect the problem by exposing health status to a monitoring endpoint. Use Datadog / Prometheus to monitor all gateway instances and sent alert if anything goes wrong.

andythsu · 2024-05-09T15:09:55Z

@oneonestar are you proposing the following:

Each gateway instance will check trino clusters' health individually, then store the health state of each cluster locally
When routing, filter out unhealthy clusters by checking local health status
Expose an endpoint to a monitoring software that alerts if anything goes wrong
- I'm thinking we can have /api/public/backends?state={state}
- for example, we can use /api/public/backends?state=UNHEALTHY to report unhealthy backends to monitoring software

oneonestar · 2024-05-10T00:10:02Z

Each gateway instance will check trino clusters' health individually, then store the health state of each cluster locally
When routing, filter out unhealthy clusters by checking local health status
Expose an endpoint to a monitoring software that alerts if anything goes wrong

Yes.

I'm thinking we can have /api/public/backends?state={state}
for example, we can use /api/public/backends?state=UNHEALTHY to report unhealthy backends to monitoring software

IMO, JMX metrics and JMX endpoint should be used.

andythsu · 2024-05-10T05:15:15Z

IMO, JMX metrics and JMX endpoint should be used.

Does that mean we need to start another JMX server?

oneonestar · 2024-05-10T09:45:53Z

Airlift and Dropwizard support JMX. We only need to export the status to MBean.

rdsarvar · 2024-05-21T21:33:09Z

Sorry for commenting and disappearing. Thanks @oneonestar for diagraming the potential issue, it's definitely clearer than my example 😄

To me, the current coupling (not the suggested design) is a blocker for production environment deployments. Which makes me lean towards having the simplest solution move forward with the option in the future to extend it.

Health state should not be global unless we have some kind of quorum mechanism agree on the status of Trino.

+1 ^^

@oneonestar are you proposing the following:

Each gateway instance will check trino clusters' health individually, then store the health state of each cluster locally

When routing, filter out unhealthy clusters by checking local health status

Expose an endpoint to a monitoring software that alerts if anything goes wrong

I'm thinking we can have /api/public/backends?state={state}

for example, we can use /api/public/backends?state=UNHEALTHY to report unhealthy backends to monitoring software

mixed with

Airlift and Dropwizard support JMX. We only need to export the status to MBean.

I +1 the simplicity of the above to keep health checking to be per pod; IMO the above is good enough for now as you get:

self-healing Trino backends
alerting on a per gateway<->backend basis (assumption: JMX metrics being ingested by something like Prometheus)
time series historical datapoints on backend health (assumption: JMX metrics being ingested by something like Prometheus)

The only thing I would prefer personally is if we leveraged OpenMetrics so it could be scraped directly by Prometheus instead of requiring jmx_exporter 😄

The drawback of the above is to holistically see the "per pod health for each backend" you wouldn't be able to use the UI and you'd be relying on a separate tool sitting on top of your datapoints like Grafana. Which seems to somewhat counter the original problem statement. (Unless you pulled the datapoints back in from your timeseries datastore but this feels a bit overkill. Edit: I guess you could also setup the pods to ping each other for that info... but... I feel like that's adding too much complexity for what it's getting. There's many ways to approach this portion I suppose)

mosabua · 2024-05-22T05:29:31Z

I have finally read all the discussion here. Thank you all for your thoughts. Here are some of mine.

I agree that we should not move the info into the shared db without a quorum mechanism or so. That would mean you have to have at least 3 Trino Gateway servers. Ideally we also track the status for each cluster from each Trino Gateway instance. However, given the current simple requirement this feels like overkill.

The proposal to track status of each backend cluster in memory of each Trino Gateway instance locally is sufficient for now. Ideally we expose this in the UI .. and if the UI included with each instance as it the default deployment now you should be able to look at the UI of the individual instances and see their preceived backend status for the Trino clusters and diagnose with that. This should be sufficient for now. What we probably want to do it have some sort of node id or instance name and display that in the UI and pull it in from the configuration somehow.. this notion exists in Trino already as well. Over time we can maybe implement storing those states in the db in additon and then showing all Trino Gateway instances and their preceived backends in the UI, but that can wait.

I also share the wish to use OpenMetrics rather than JMX. Good thing is we are getting OpenMetrcis capability soon .. since we are moving to Airlift. But we can still use JMX in addition, if desired.. Trino itself does both as well.

One question I have for @willmostly and others.. I assume the idea with multiple Trino Gateway instances is to have another load balancer in front of them that uses DNS/FQDN for load balancing. This will have to be a sticky config per session otherwise you could get into weird issues where the different Trino Gateway instances have potentially different Trino cluster statuses and queries therefore might get routed by one but not another instance. Assuming we do want to support that deployment model (I think we do).. we should document this all better.

andythsu changed the title ~~Trino gateway health states design~~ Trino gateway health state design Jan 30, 2024

andythsu mentioned this issue May 1, 2024

introduced 'Pending' state for backend's health #163

Closed

andythsu mentioned this issue May 24, 2024

Add PENDING type to healthchecks #358

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trino gateway health state design #222

Trino gateway health state design #222

andythsu commented Jan 30, 2024 •

edited

rdsarvar commented Feb 22, 2024 •

edited

andythsu commented May 1, 2024

andythsu commented May 1, 2024

andythsu commented May 1, 2024 •

edited

andythsu commented May 1, 2024

Chaho12 commented May 1, 2024

andythsu commented May 6, 2024

oneonestar commented May 8, 2024

andythsu commented May 9, 2024 •

edited

Chaho12 commented May 9, 2024

oneonestar commented May 9, 2024 •

edited

andythsu commented May 9, 2024

oneonestar commented May 10, 2024

andythsu commented May 10, 2024 •

edited

oneonestar commented May 10, 2024

rdsarvar commented May 21, 2024 •

edited

mosabua commented May 22, 2024 •

edited

Trino gateway health state design #222

Trino gateway health state design #222

Comments

andythsu commented Jan 30, 2024 • edited

rdsarvar commented Feb 22, 2024 • edited

andythsu commented May 1, 2024

andythsu commented May 1, 2024

andythsu commented May 1, 2024 • edited

andythsu commented May 1, 2024

Chaho12 commented May 1, 2024

andythsu commented May 6, 2024

oneonestar commented May 8, 2024

andythsu commented May 9, 2024 • edited

Chaho12 commented May 9, 2024

oneonestar commented May 9, 2024 • edited

andythsu commented May 9, 2024

oneonestar commented May 10, 2024

andythsu commented May 10, 2024 • edited

oneonestar commented May 10, 2024

rdsarvar commented May 21, 2024 • edited

mosabua commented May 22, 2024 • edited

andythsu commented Jan 30, 2024 •

edited

rdsarvar commented Feb 22, 2024 •

edited

andythsu commented May 1, 2024 •

edited

andythsu commented May 9, 2024 •

edited

oneonestar commented May 9, 2024 •

edited

andythsu commented May 10, 2024 •

edited

rdsarvar commented May 21, 2024 •

edited

mosabua commented May 22, 2024 •

edited