Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed worker doesn't handle requests properly on recovery #21735

Closed
sajjoseph opened this issue Apr 28, 2024 · 13 comments · Fixed by #21921 · May be fixed by #21744
Closed

Failed worker doesn't handle requests properly on recovery #21735

sajjoseph opened this issue Apr 28, 2024 · 13 comments · Fixed by #21921 · May be fixed by #21744

Comments

@sajjoseph
Copy link
Contributor

sajjoseph commented Apr 28, 2024

Environment: AWS EC2 using Graviton
Version: Trino 444
JDK: jdk-21.0.1+12
Jetty: 12.0.8
HTTP mode: HTTP/2

One of the worker processes failed (due to memory pressure I believe) in our environment and the Trino process on that worker was restarted. Everything was normal and there were no issues detected during restart. Once it joined the cluster, we ran the following query.

SELECT TABLE_CATALOG as TABLE_CAT, TABLE_SCHEMA AS TABLE_SCHEM, TABLE_NAME, TABLE_TYPE FROM <catalog>.information_schema.tables WHERE TABLE_CATALOG = '<catalog_name>' AND TABLE_SCHEMA LIKE '<schema_name>' ESCAPE '\' AND TABLE_NAME LIKE '<table_name>' ESCAPE '\' AND TABLE_TYPE IN ('TABLE', 'VIEW', 'BASE TABLE') ORDER BY TABLE_TYPE, TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME;

This is when we noticed the following:
Bad worker node not doing work properly after restart

Even though it is a metadata query, all workers are joining this query in Stage 1 and Stage 2 is coordinator. Coordinator is returning the metadata to one of the workers and other workers in Stage 1 are finishing except the bad worker who remains on "PLANNED" state.

At the same time, the worker node had a bunch of null pointer exceptions in its server.log file.

2024-04-28T17:20:37.673Z        INFO    task-management-4       io.trino.execution.SqlTaskManager       Failing abandoned task 20240428_171834_85245_ffcjv.0.2.0
2024-04-28T17:22:16.883Z        INFO    task-management-1       io.trino.execution.SqlTaskManager       Failing abandoned task 20240428_171537_84670_ffcjv.2.9.0
2024-04-28T17:23:10.112Z        WARN    async-http-response-1   org.glassfish.jersey.server.ServerRuntime$Responder     An exception mapping did not successfully produce and processed a response. Logging the exception propagated to the default exception mapper.
java.lang.NullPointerException: Cannot invoke "org.eclipse.jetty.server.Response.getHeaders()" because the return value of "org.eclipse.jetty.ee10.servlet.ServletApiResponse.getResponse()" is null
        at org.eclipse.jetty.ee10.servlet.ServletApiResponse.containsHeader(ServletApiResponse.java:117)
        at jakarta.servlet.http.HttpServletResponseWrapper.containsHeader(HttpServletResponseWrapper.java:68)
        at org.glassfish.jersey.servlet.internal.ResponseWriter.writeResponseStatusAndHeaders(ResponseWriter.java:135)
        at org.glassfish.jersey.server.ServerRuntime$Responder.writeResponse(ServerRuntime.java:650)
        at org.glassfish.jersey.server.ServerRuntime$Responder.processResponse(ServerRuntime.java:387)
        at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:377)
        at org.glassfish.jersey.server.ServerRuntime$AsyncResponder$3.run(ServerRuntime.java:913)
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
        at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:266)
        at org.glassfish.jersey.server.ServerRuntime$AsyncResponder.resume(ServerRuntime.java:950)
        at org.glassfish.jersey.server.ServerRuntime$AsyncResponder.resume(ServerRuntime.java:901)
        at io.airlift.jaxrs.AsyncResponseHandler$1.onSuccess(AsyncResponseHandler.java:99)
        at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1137)
        at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:79)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)


2024-04-28T17:23:10.114Z        ERROR   async-http-response-1   org.glassfish.jersey.server.ServerRuntime$Responder     Error while closing the output stream in order to commit response.
java.lang.NullPointerException: Cannot invoke "org.eclipse.jetty.ee10.servlet.ServletContextResponse.isCommitted()" because the return value of "org.eclipse.jetty.ee10.servlet.ServletChannel.getServletContextResponse()" is null
        at org.eclipse.jetty.ee10.servlet.ServletChannelState.isResponseCommitted(ServletChannelState.java:315)
        at org.eclipse.jetty.ee10.servlet.ServletChannel.isCommitted(ServletChannel.java:767)
        at org.eclipse.jetty.ee10.servlet.ServletApiResponse.isCommitted(ServletApiResponse.java:423)
        at org.eclipse.jetty.ee10.servlet.ServletApiResponse.setContentLengthLong(ServletApiResponse.java:348)
        at org.eclipse.jetty.ee10.servlet.ServletApiResponse.setContentLength(ServletApiResponse.java:339)
        at jakarta.servlet.ServletResponseWrapper.setContentLength(ServletResponseWrapper.java:115)
        at org.glassfish.jersey.servlet.internal.ResponseWriter.writeResponseStatusAndHeaders(ResponseWriter.java:122)
        at org.glassfish.jersey.server.ServerRuntime$Responder$1.getOutputStream(ServerRuntime.java:668)
        at org.glassfish.jersey.message.internal.CommittingOutputStream.commitStream(CommittingOutputStream.java:171)
        at org.glassfish.jersey.message.internal.CommittingOutputStream.flushBuffer(CommittingOutputStream.java:276)
        at org.glassfish.jersey.message.internal.CommittingOutputStream.commit(CommittingOutputStream.java:232)
        at org.glassfish.jersey.message.internal.CommittingOutputStream.close(CommittingOutputStream.java:247)
        at org.glassfish.jersey.message.internal.OutboundMessageContext.close(OutboundMessageContext.java:565)
        at org.glassfish.jersey.server.ContainerResponse.close(ContainerResponse.java:403)
        at org.glassfish.jersey.server.ServerRuntime$Responder.writeResponse(ServerRuntime.java:750)
        at org.glassfish.jersey.server.ServerRuntime$Responder.processResponse(ServerRuntime.java:387)
        at org.glassfish.jersey.server.ServerRuntime$Responder.processResponseWithDefaultExceptionMapper(ServerRuntime.java:634)
        at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:460)
        at org.glassfish.jersey.server.ServerRuntime$AsyncResponder$3.run(ServerRuntime.java:915)
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:274)

Finally, the above query will fail with PAGE_TRANSPORT_TIMEOUT error against the bad worker.
Stack trace:

	at io.trino.operator.HttpPageBufferClient$1.onFailure(HttpPageBufferClient.java:505)
	at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed
	at io.airlift.http.client.ResponseHandlerUtils.propagate(ResponseHandlerUtils.java:25)
	at io.trino.operator.HttpPageBufferClient$PageResponseHandler.handleException(HttpPageBufferClient.java:665)
	at io.trino.operator.HttpPageBufferClient$PageResponseHandler.handleException(HttpPageBufferClient.java:652)
	at io.airlift.http.client.jetty.JettyResponseFuture.failed(JettyResponseFuture.java:152)
	at io.airlift.http.client.jetty.BufferingResponseListener.onComplete(BufferingResponseListener.java:84)
	at org.eclipse.jetty.client.transport.ResponseListeners.notifyComplete(ResponseListeners.java:350)
	at org.eclipse.jetty.client.transport.ResponseListeners.lambda$addCompleteListener$7(ResponseListeners.java:335)
	at org.eclipse.jetty.client.transport.ResponseListeners.notifyComplete(ResponseListeners.java:350)
	at org.eclipse.jetty.client.transport.ResponseListeners.notifyComplete(ResponseListeners.java:342)
	at org.eclipse.jetty.client.transport.HttpExchange.notifyFailureComplete(HttpExchange.java:307)
	at org.eclipse.jetty.client.transport.HttpExchange.abort(HttpExchange.java:278)
	at org.eclipse.jetty.client.transport.HttpConversation.abort(HttpConversation.java:162)
	at org.eclipse.jetty.client.transport.HttpRequest.abort(HttpRequest.java:795)
	at org.eclipse.jetty.client.transport.HttpDestination$RequestTimeouts.onExpired(HttpDestination.java:596)
	at org.eclipse.jetty.client.transport.HttpDestination$RequestTimeouts.onExpired(HttpDestination.java:579)
	at org.eclipse.jetty.io.CyclicTimeouts.onTimeoutExpired(CyclicTimeouts.java:113)
	at org.eclipse.jetty.io.CyclicTimeouts$Timeouts.onTimeoutExpired(CyclicTimeouts.java:206)
	at org.eclipse.jetty.io.CyclicTimeout$Wakeup.run(CyclicTimeout.java:294)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	... 3 more
Caused by: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed
	... 11 more

I cannot restart the coordinator to see if it will let the restarted worker node stay healthy. NPE in the log file is little suspicious. If I perform graceful shutdown on a healthy worker and restart it, it doesn't suffer from the above behavior. All workers are configured identical. I even rebooted the bad worker node to see if it will become healthy and it doesn't.

I wonder whether the coordinator or other workers retain some reference to past failed queries/tasks and requests for an update as soon as the bad worker rejoins the cluster as it retains the original ip address.

@wendigo - please see if the above scenario makes sense to you. I suspect that something is still lurking in the cluster about this worker node that any query uses it will fail.

@sopel39, @findepi - Have you seen similar issues?

@wendigo
Copy link
Contributor

wendigo commented Apr 28, 2024

HTTP/2 is not really tested or supported (at least for now). We've been reporting similar issues to the jetty team and they are steadily being fixed. NPE usually means "use after free" which suggests that request/response was already recycled but some parts of the code still try to use it and access/write. This could happen if the jetty client i.e. aborts the request due to a timeout. There are bunch of fixes for this scenarios that will be a part of the next jetty release that we will update to shortly after.

@sajjoseph
Copy link
Contributor Author

Thanks! Are you considering Jetty 12.0.9 or higher?

Jetty 12.0.9 is out and airlift 247-SNAPSHOT is still configured to use 12.0.8.. I am using 12.0.8 and still seeing the above issue. Are we planning to upgrade to 12.0.9 or its next release?

@wendigo
Copy link
Contributor

wendigo commented Apr 29, 2024

@sajjoseph next release

@wendigo
Copy link
Contributor

wendigo commented Apr 29, 2024

@sajjoseph actually Jetty 12.0.9 is not yet out. It's tagged but artifacts are not yet published

@findepi
Copy link
Member

findepi commented Apr 29, 2024

cc @losipiuk

@sajjoseph
Copy link
Contributor Author

@wendigo - I tried 12.0.9 snapshot version and looks like jetty problems are fixed now. I don't see the above exceptions anymore.

But, I still see that the restarted worker node is not doing any work as the assigned task to that worker is staying in "PLANNED" state. All other workers finished their work.

Eventually, the query fails with the following PAGE_TRANSPORT_TIMEOUT exception:

io.trino.operator.PageTransportTimeoutException: Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes. (https://<worker_ip>:8443/v1/task/20240505_025223_07108_qk3nn.26.10.0/results/0/0 - 30 failures, failure duration 302.86s, total failed request time 312.86s)
	at io.trino.operator.HttpPageBufferClient$1.onFailure(HttpPageBufferClient.java:505)
	at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed
	at io.airlift.http.client.ResponseHandlerUtils.propagate(ResponseHandlerUtils.java:25)
	at io.trino.operator.HttpPageBufferClient$PageResponseHandler.handleException(HttpPageBufferClient.java:665)
	at io.trino.operator.HttpPageBufferClient$PageResponseHandler.handleException(HttpPageBufferClient.java:652)
	at io.airlift.http.client.jetty.JettyResponseFuture.failed(JettyResponseFuture.java:152)
	at io.airlift.http.client.jetty.BufferingResponseListener.onComplete(BufferingResponseListener.java:84)
	at org.eclipse.jetty.client.transport.ResponseListeners.notifyComplete(ResponseListeners.java:350)
	at org.eclipse.jetty.client.transport.ResponseListeners.lambda$addCompleteListener$7(ResponseListeners.java:335)
	at org.eclipse.jetty.client.transport.ResponseListeners.notifyComplete(ResponseListeners.java:350)
	at org.eclipse.jetty.client.transport.ResponseListeners.notifyComplete(ResponseListeners.java:342)
	at org.eclipse.jetty.client.transport.HttpExchange.notifyFailureComplete(HttpExchange.java:307)
	at org.eclipse.jetty.client.transport.HttpExchange.abort(HttpExchange.java:278)
	at org.eclipse.jetty.client.transport.HttpConversation.abort(HttpConversation.java:162)
	at org.eclipse.jetty.client.transport.HttpRequest.abort(HttpRequest.java:795)
	at org.eclipse.jetty.client.transport.HttpDestination$RequestTimeouts.onExpired(HttpDestination.java:596)
	at org.eclipse.jetty.client.transport.HttpDestination$RequestTimeouts.onExpired(HttpDestination.java:579)
	at org.eclipse.jetty.io.CyclicTimeouts.onTimeoutExpired(CyclicTimeouts.java:113)
	at org.eclipse.jetty.io.CyclicTimeouts$Timeouts.onTimeoutExpired(CyclicTimeouts.java:206)
	at org.eclipse.jetty.io.CyclicTimeout$Wakeup.run(CyclicTimeout.java:294)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	... 3 more
Caused by: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed
	... 11 more

@findepi / @losipiuk / @wendigo
If you can guide me to the code area where I can debug it further (to troubleshoot why worker is stuck in PLANNED state), I really appreciate it. Is this a coordinator issue or worker issue? Since the worker was restarted, it has no recollection of past failures. I suspect that the coordinator is not triggering something that blocks the worker from starting with the task.

Looking forward to your assistance.
Thanks!

@findepi
Copy link
Member

findepi commented May 6, 2024

@sajjoseph are we taking about FTE or pipeline execution?
Can #21744 be related?

@sajjoseph
Copy link
Contributor Author

We are not using FTE.
But, the issue you addressed in the PR could be the one we are struggling with.
I will try it out in our build and see how the cluster behaves.

Thanks!

@sajjoseph
Copy link
Contributor Author

sajjoseph commented May 9, 2024

Quick update.
Good news! I built 446 with PR #21744 and I am not able to reproduce the issue mentioned above. Worker was restarted multiple times and so far it has no trouble in joining the cluster and take on fresh queries. Coordinator is not treating that worker as a stepchild.
Testing is continuing, but so far everything looks good. Thanks a bunch @findepi. This is going to address a ton of issues we were facing in prod environment.

I will be upgrading the cluster to the latest (447) tonight. I am not anticipating any issues though.

@findepi
Copy link
Member

findepi commented May 9, 2024

@sajjoseph glad to hear positive feedback on the #21744 PR!

Worker was restarted multiple times and so far it has no trouble in joining the cluster and take on fresh queries

I am interesting in this scenario. I was not able to reproduce this particular problem on current master (before #21744) and this problem wasn't the motivation for that PR.
Quite contrary (#21744 (comment)) -- on current master, if i restart a worker, the coordinator can use it immediately (once it is fully initialized). After #21744, coordinator can use it only after it's fully initialized and fully announced.

Now that master is updated to jetty 12.0.9 I will try to reproduce #21735 (comment) again with a different test query.

@findepi
Copy link
Member

findepi commented May 9, 2024

Now that master is updated to jetty 12.0.9 I will try to reproduce #21735 (comment) again with a different test query.

wasn't able to repro with a query involving ORDER BY (from the issue description) with and without HTTP/2 for internal comms.

@sajjoseph
Copy link
Contributor Author

I used a heavily modified version of query replay tool to re-run production traffic in a separate cluster. I purposefully kept just one worker in that cluster so that I can make the worker node fail with load (I was able to do that consistently with 444 version). After restart, the worker node was getting into the state I described above.

After #21744, the worker is not failing even after seeing 100s of concurrent queries (many queries were failing due to memory pressure as the worker can't handle the data volume, but the worker itself didn't go down this time!!! Kudos to Trino 447, JDK 22, Jetty 12.0.9, Airlift 248, HTTP/2 etc). I am testing it with even more queries and see if I can make it fail.

Even though worker didn't fail, I manually restarted it multiple times (in the middle of validations) and it just joined the cluster and started processing fresh queries that came in. This is why I feel that #21744 is helping with the above issue.

I will keep you posted.

@sopel39
Copy link
Member

sopel39 commented May 13, 2024

Potentially related #18329.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants