New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trino gateway health state design #222
Comments
Spitballing a scenario here - What happens in the case that Kube/Core DNS (or anything node specific tbh) fails to resolve FQDNs for one of the N nodes? Would this result in total outage as the single node running Gateway would continuously set all backends to failed in the central store? |
no, gateway will use the api provided by trino to determine trino's state. In the case where FQDNs fail, it should result in another error |
Please see https://lucid.app/lucidchart/1058db28-d6f7-4d35-95db-1231e295b989/edit?viewport_loc=-128%2C270%2C2003%2C1154%2C0_0&invitationId=inv_e2f1449c-560e-458d-81e2-cffc0d2b98e7 for the flow chart of this design.
This part is not included in the flow chart because I'm not sure if this is still relevant. We can omit this part to keep the logic simple or we can include this part to make gateway a bit more intelligent. |
One note: Right now the logic to fetch Lines 85 to 87 in b616d40
ClusterStatsJdbcMonitor and Lines 70 to 77 in b616d40
ClusterStatsHttpMonitor
If we are storing health state in the db that means we will need to decouple setting health from setting RUNNING/QUEUED queries |
Another note: once the admin deactivates the trino backend, trino-gateway should clear that trino backend's state from shared DB and its local cache. |
Sounds all good except one thing. I am not so sure about creating a new table. |
yes we could do that |
I have the same concern as @rdsarvar. The following happened in our production environment.
If we have a global health status in database, gateway-3 would continuously set Trino-1 to failed, but the truth is only Gateway-3 to Trino-1 failed. It will also be hard to find out which gateway updated the record. I would rather like to have different information on gateway UI, so that I can figure out the issue on Gateway-3 quickly. |
I'm leaning towards @oneonestar 's way for this issue tbh. On a second thought, the results from having the gateway check DB for trino states could be misleading. For example, if gateway cannot reach trino due to network issue but it still reports trino as healthy because it's only checking DB without actually interacting with trino, it could lead to gateway route the query to the unreachable trino. @willmostly |
How would this work for different gateway servers when we login to web ui?
so that admin will know which server has issue + when routing to adhoc group, it will not route to gateway 3 |
That won't work well in dynamic environment such as K8S + Auto Scaling. We can detect the problem by exposing health status to a monitoring endpoint. Use Datadog / Prometheus to monitor all gateway instances and sent alert if anything goes wrong. |
@oneonestar are you proposing the following:
|
Yes.
IMO, JMX metrics and JMX endpoint should be used. |
Does that mean we need to start another JMX server? |
Airlift and Dropwizard support JMX. We only need to export the status to MBean. |
Sorry for commenting and disappearing. Thanks @oneonestar for diagraming the potential issue, it's definitely clearer than my example 😄 To me, the current coupling (not the suggested design) is a blocker for production environment deployments. Which makes me lean towards having the simplest solution move forward with the option in the future to extend it.
+1 ^^
mixed with
I +1 the simplicity of the above to keep health checking to be per pod; IMO the above is good enough for now as you get:
The only thing I would prefer personally is if we leveraged OpenMetrics so it could be scraped directly by Prometheus instead of requiring jmx_exporter 😄 The drawback of the above is to holistically see the "per pod health for each backend" you wouldn't be able to use the UI and you'd be relying on a separate tool sitting on top of your datapoints like Grafana. Which seems to somewhat counter the original problem statement. (Unless you pulled the datapoints back in from your timeseries datastore but this feels a bit overkill. Edit: I guess you could also setup the pods to ping each other for that info... but... I feel like that's adding too much complexity for what it's getting. There's many ways to approach this portion I suppose) |
I have finally read all the discussion here. Thank you all for your thoughts. Here are some of mine. I agree that we should not move the info into the shared db without a quorum mechanism or so. That would mean you have to have at least 3 Trino Gateway servers. Ideally we also track the status for each cluster from each Trino Gateway instance. However, given the current simple requirement this feels like overkill. The proposal to track status of each backend cluster in memory of each Trino Gateway instance locally is sufficient for now. Ideally we expose this in the UI .. and if the UI included with each instance as it the default deployment now you should be able to look at the UI of the individual instances and see their preceived backend status for the Trino clusters and diagnose with that. This should be sufficient for now. What we probably want to do it have some sort of node id or instance name and display that in the UI and pull it in from the configuration somehow.. this notion exists in Trino already as well. Over time we can maybe implement storing those states in the db in additon and then showing all Trino Gateway instances and their preceived backends in the UI, but that can wait. I also share the wish to use OpenMetrics rather than JMX. Good thing is we are getting OpenMetrcis capability soon .. since we are moving to Airlift. But we can still use JMX in addition, if desired.. Trino itself does both as well. One question I have for @willmostly and others.. I assume the idea with multiple Trino Gateway instances is to have another load balancer in front of them that uses DNS/FQDN for load balancing. This will have to be a sticky config per session otherwise you could get into weird issues where the different Trino Gateway instances have potentially different Trino cluster statuses and queries therefore might get routed by one but not another instance. Assuming we do want to support that deployment model (I think we do).. we should document this all better. |
Problem: Currently the health state of trino cluster is stored in-memory in trino-gateway. This is a problem because there can be multiple instances of trino-gateway, and their trino cluster states can be inconsistent. This can lead to displaying different information on gateway UI.
Solution: Create a DB to store health state of trino cluster for ALL instances of trino-gateway. The DB will have a table named
backend_state
, with the table schema(name: String, is_healthy: boolean, last_update_time: timestamp)
Each trino-gateway instance should have different
taskDelayMin
, as defined here:trino-gateway/gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ActiveClusterMonitor.java
Line 35 in 9bbf62c
(Note: currently this line is not configurable, but we will create a PR to make it configurable.)
Then, once trino-gateway is up, it will get health state from the DB before checking with trino clusters at intervals of
taskDelayMin
. If thelast_update_time
is greater than X amount of time (this will also be configurable) or ifis_healthy=false
, then trino-gateway will get the health status from trino backend and store the result in DB. Otherwise, it will just use the state from the DBIf trino-gateway sees the health state is unhealthy, then it will maintain an in-memory storage where it will make Y tries (again, configurable) to trino backend. Trino-gateway will only treat trino backend as healthy AFTER it has Z successes (again, configurable) from the trino backend. After it reaches Z successes, trino-gateway will update the trino backend record in DB.
In the beginning, all trino backends will start with
PENDING
in DB.The text was updated successfully, but these errors were encountered: