Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trino gateway health state design #222

Open
andythsu opened this issue Jan 30, 2024 · 17 comments
Open

Trino gateway health state design #222

andythsu opened this issue Jan 30, 2024 · 17 comments

Comments

@andythsu
Copy link
Member

andythsu commented Jan 30, 2024

Problem: Currently the health state of trino cluster is stored in-memory in trino-gateway. This is a problem because there can be multiple instances of trino-gateway, and their trino cluster states can be inconsistent. This can lead to displaying different information on gateway UI.

Solution: Create a DB to store health state of trino cluster for ALL instances of trino-gateway. The DB will have a table named backend_state, with the table schema (name: String, is_healthy: boolean, last_update_time: timestamp)

Each trino-gateway instance should have different taskDelayMin, as defined here:

(Note: currently this line is not configurable, but we will create a PR to make it configurable.)

Then, once trino-gateway is up, it will get health state from the DB before checking with trino clusters at intervals of taskDelayMin. If the last_update_time is greater than X amount of time (this will also be configurable) or if is_healthy=false, then trino-gateway will get the health status from trino backend and store the result in DB. Otherwise, it will just use the state from the DB

If trino-gateway sees the health state is unhealthy, then it will maintain an in-memory storage where it will make Y tries (again, configurable) to trino backend. Trino-gateway will only treat trino backend as healthy AFTER it has Z successes (again, configurable) from the trino backend. After it reaches Z successes, trino-gateway will update the trino backend record in DB.

In the beginning, all trino backends will start with PENDING in DB.

@andythsu andythsu changed the title Trino gateway health states design Trino gateway health state design Jan 30, 2024
@rdsarvar
Copy link
Contributor

rdsarvar commented Feb 22, 2024

Spitballing a scenario here -

What happens in the case that Kube/Core DNS (or anything node specific tbh) fails to resolve FQDNs for one of the N nodes? Would this result in total outage as the single node running Gateway would continuously set all backends to failed in the central store?

@andythsu
Copy link
Member Author

andythsu commented May 1, 2024

Spitballing a scenario here -

What happens in the case that Kube/Core DNS (or anything node specific tbh) fails to resolve FQDNs for one of the N nodes? Would this result in total outage as the single node running Gateway would continuously set all backends to failed in the central store?

no, gateway will use the api provided by trino to determine trino's state. In the case where FQDNs fail, it should result in another error

@andythsu
Copy link
Member Author

andythsu commented May 1, 2024

Please see https://lucid.app/lucidchart/1058db28-d6f7-4d35-95db-1231e295b989/edit?viewport_loc=-128%2C270%2C2003%2C1154%2C0_0&invitationId=inv_e2f1449c-560e-458d-81e2-cffc0d2b98e7 for the flow chart of this design.

If trino-gateway sees the health state is unhealthy, then it will maintain an in-memory storage where it will make Y tries (again, configurable) to trino backend. Trino-gateway will only treat trino backend as healthy AFTER it has Z successes (again, configurable) from the trino backend. After it reaches Z successes, trino-gateway will update the trino backend record in DB.

This part is not included in the flow chart because I'm not sure if this is still relevant. We can omit this part to keep the logic simple or we can include this part to make gateway a bit more intelligent.

@andythsu
Copy link
Member Author

andythsu commented May 1, 2024

One note: Right now the logic to fetch RUNNING queries and QUEUED queries is coupled with setting healthy state. See

.healthy(true)
.queuedQueryCount(partialState.getOrDefault("QUEUED", 0))
.runningQueryCount(partialState.getOrDefault("RUNNING", 0))
for ClusterStatsJdbcMonitor and
clusterStats
.numWorkerNodes(activeWorkers)
.queuedQueryCount((int) result.get("queuedQueries"))
.runningQueryCount((int) result.get("runningQueries"))
.healthy(activeWorkers > 0)
.proxyTo(backend.getProxyTo())
.externalUrl(backend.getExternalUrl())
.routingGroup(backend.getRoutingGroup());
for ClusterStatsHttpMonitor

If we are storing health state in the db that means we will need to decouple setting health from setting RUNNING/QUEUED queries

@andythsu
Copy link
Member Author

andythsu commented May 1, 2024

Another note: once the admin deactivates the trino backend, trino-gateway should clear that trino backend's state from shared DB and its local cache.

@Chaho12
Copy link
Member

Chaho12 commented May 1, 2024

Sounds all good except one thing. I am not so sure about creating a new table.
What would be the reasons for not adding 2 fields to gateway_backend table?

@andythsu
Copy link
Member Author

andythsu commented May 6, 2024

Sounds all good except one thing. I am not so sure about creating a new table. What would be the reasons for not adding 2 fields to gateway_backend table?

yes we could do that

@oneonestar
Copy link
Member

I have the same concern as @rdsarvar.
Health state should not be global unless we have some kind of quorum mechanism agree on the status of Trino.

The following happened in our production environment.

                 /-----> Gateway-1 -----> Trino-1
Client -> LoadBalancer-> Gateway-2 -----> Trino-1
                 \-----> Gateway-3 -(x)-> Trino-1 (due to network issue)

If we have a global health status in database, gateway-3 would continuously set Trino-1 to failed, but the truth is only Gateway-3 to Trino-1 failed. It will also be hard to find out which gateway updated the record.

I would rather like to have different information on gateway UI, so that I can figure out the issue on Gateway-3 quickly.

@andythsu
Copy link
Member Author

andythsu commented May 9, 2024

I'm leaning towards @oneonestar 's way for this issue tbh. On a second thought, the results from having the gateway check DB for trino states could be misleading. For example, if gateway cannot reach trino due to network issue but it still reports trino as healthy because it's only checking DB without actually interacting with trino, it could lead to gateway route the query to the unreachable trino. @willmostly

@Chaho12
Copy link
Member

Chaho12 commented May 9, 2024

I would rather like to have different information on gateway UI, so that I can figure out the issue on Gateway-3 quickly.

How would this work for different gateway servers when we login to web ui?
How about saving each gateway server status to db and show green/red buttons for each servers like below.

cluster(group) |                            Active                        |
   adhoc          gateway 1 (green) | gateway 2 (green) | gateway 3 (red)

so that admin will know which server has issue + when routing to adhoc group, it will not route to gateway 3

@oneonestar
Copy link
Member

oneonestar commented May 9, 2024

How about saving each gateway server status to db and show green/red buttons for each servers like below.

That won't work well in dynamic environment such as K8S + Auto Scaling.

We can detect the problem by exposing health status to a monitoring endpoint. Use Datadog / Prometheus to monitor all gateway instances and sent alert if anything goes wrong.

@andythsu
Copy link
Member Author

andythsu commented May 9, 2024

@oneonestar are you proposing the following:

  • Each gateway instance will check trino clusters' health individually, then store the health state of each cluster locally
  • When routing, filter out unhealthy clusters by checking local health status
  • Expose an endpoint to a monitoring software that alerts if anything goes wrong
    • I'm thinking we can have /api/public/backends?state={state}
    • for example, we can use /api/public/backends?state=UNHEALTHY to report unhealthy backends to monitoring software

@oneonestar
Copy link
Member

Each gateway instance will check trino clusters' health individually, then store the health state of each cluster locally
When routing, filter out unhealthy clusters by checking local health status
Expose an endpoint to a monitoring software that alerts if anything goes wrong

Yes.

I'm thinking we can have /api/public/backends?state={state}
for example, we can use /api/public/backends?state=UNHEALTHY to report unhealthy backends to monitoring software

IMO, JMX metrics and JMX endpoint should be used.

@andythsu
Copy link
Member Author

andythsu commented May 10, 2024

IMO, JMX metrics and JMX endpoint should be used.

Does that mean we need to start another JMX server?

@oneonestar
Copy link
Member

Airlift and Dropwizard support JMX. We only need to export the status to MBean.

@rdsarvar
Copy link
Contributor

rdsarvar commented May 21, 2024

Sorry for commenting and disappearing. Thanks @oneonestar for diagraming the potential issue, it's definitely clearer than my example 😄

To me, the current coupling (not the suggested design) is a blocker for production environment deployments. Which makes me lean towards having the simplest solution move forward with the option in the future to extend it.

Health state should not be global unless we have some kind of quorum mechanism agree on the status of Trino.

+1 ^^

@oneonestar are you proposing the following:

  • Each gateway instance will check trino clusters' health individually, then store the health state of each cluster locally

  • When routing, filter out unhealthy clusters by checking local health status

  • Expose an endpoint to a monitoring software that alerts if anything goes wrong

    • I'm thinking we can have /api/public/backends?state={state}
    • for example, we can use /api/public/backends?state=UNHEALTHY to report unhealthy backends to monitoring software

mixed with

Airlift and Dropwizard support JMX. We only need to export the status to MBean.

I +1 the simplicity of the above to keep health checking to be per pod; IMO the above is good enough for now as you get:

  • self-healing Trino backends
  • alerting on a per gateway<->backend basis (assumption: JMX metrics being ingested by something like Prometheus)
  • time series historical datapoints on backend health (assumption: JMX metrics being ingested by something like Prometheus)

The only thing I would prefer personally is if we leveraged OpenMetrics so it could be scraped directly by Prometheus instead of requiring jmx_exporter 😄

The drawback of the above is to holistically see the "per pod health for each backend" you wouldn't be able to use the UI and you'd be relying on a separate tool sitting on top of your datapoints like Grafana. Which seems to somewhat counter the original problem statement. (Unless you pulled the datapoints back in from your timeseries datastore but this feels a bit overkill. Edit: I guess you could also setup the pods to ping each other for that info... but... I feel like that's adding too much complexity for what it's getting. There's many ways to approach this portion I suppose)

@mosabua
Copy link
Member

mosabua commented May 22, 2024

I have finally read all the discussion here. Thank you all for your thoughts. Here are some of mine.

I agree that we should not move the info into the shared db without a quorum mechanism or so. That would mean you have to have at least 3 Trino Gateway servers. Ideally we also track the status for each cluster from each Trino Gateway instance. However, given the current simple requirement this feels like overkill.

The proposal to track status of each backend cluster in memory of each Trino Gateway instance locally is sufficient for now. Ideally we expose this in the UI .. and if the UI included with each instance as it the default deployment now you should be able to look at the UI of the individual instances and see their preceived backend status for the Trino clusters and diagnose with that. This should be sufficient for now. What we probably want to do it have some sort of node id or instance name and display that in the UI and pull it in from the configuration somehow.. this notion exists in Trino already as well. Over time we can maybe implement storing those states in the db in additon and then showing all Trino Gateway instances and their preceived backends in the UI, but that can wait.

I also share the wish to use OpenMetrics rather than JMX. Good thing is we are getting OpenMetrcis capability soon .. since we are moving to Airlift. But we can still use JMX in addition, if desired.. Trino itself does both as well.

One question I have for @willmostly and others.. I assume the idea with multiple Trino Gateway instances is to have another load balancer in front of them that uses DNS/FQDN for load balancing. This will have to be a sticky config per session otherwise you could get into weird issues where the different Trino Gateway instances have potentially different Trino cluster statuses and queries therefore might get routed by one but not another instance. Assuming we do want to support that deployment model (I think we do).. we should document this all better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

5 participants