Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket-mode application become a zombie when unhandled exception occurs in single-threaded thread pool #833

Open
ddovgal opened this issue Sep 1, 2021 · 11 comments
Assignees
Labels
auto-triage-skip needs info question M-T: User needs support to use the project

Comments

@ddovgal
Copy link

ddovgal commented Sep 1, 2021

In our application, we use Bolt Socket-mode SDK for the bot. And we've found some strange behavior. Sometimes (looks like it doesn't depend on anything) app becomes a "zombie", meaning that it continues to work but the bot isn't in a working state. For example, let's say for the message with text ping it will respond with a pong message, and in that "non-working" state bot won't react to the ping message. But at the same time app isn't "dead" or crashed, we can successfully "stop" SocketModeApp later. According to our observations, this always occurs after javax.websocket.DeploymentException

(we use SocketModeClientTyrusImpl)

I suspect (but still can be wrong 😅) this is due to the facts that:

  • we have a single-threaded ThreadPool-s for com.slack.api.socket_mode.SocketModeClient#initializeSessionMonitorExecutor and com.slack.api.socket_mode.SocketModeClient#initializeMessageProcessorExecutor. Both create TP-s by Executors.newSingleThreadScheduledExecutor
  • here are two exceptions logs that show unhandled exception occurrences:
    • [2021-08-27T16:43:47,600Z](socket-mode-session-monitor-worker-36)([]) trace_id= ERROR - SocketModeClient - Failed to reconnect to Socket Mode server: javax.websocket.DeploymentException: Handshake error.
      ! org.glassfish.tyrus.core.HandshakeException: Response code was not 101: 408.
      ! at org.glassfish.tyrus.client.TyrusClientEngine.processResponse(TyrusClientEngine.java:299)
      ! at org.glassfish.tyrus.container.grizzly.client.GrizzlyClientFilter.handleHandshake(GrizzlyClientFilter.java:322)
      ! at org.glassfish.tyrus.container.grizzly.client.GrizzlyClientFilter.handleRead(GrizzlyClientFilter.java:291)
      ! at org.glassfish.grizzly.filterchain.ExecutorResolver$9.execute(ExecutorResolver.java:119)
      ! at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:283)
      ! at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.java:200)
      ! at org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:132)
      ! at org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:111)
      ! at org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:77)
      ! at org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:536)
      ! at org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:112)
      ! at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.run0(WorkerThreadIOStrategy.java:117)
      ! at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.access$100(WorkerThreadIOStrategy.java:56)
      ! at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy$WorkerThreadRunnable.run(WorkerThreadIOStrategy.java:137)
      ! at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:591)
      ! at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:571)
      ! ... 1 common frames omitted
      ! Causing: javax.websocket.DeploymentException: Handshake error.
      ! at org.glassfish.tyrus.client.ClientManager$3$1.run(ClientManager.java:658)
      ! at org.glassfish.tyrus.client.ClientManager$3.run(ClientManager.java:696)
      ! at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      ! at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      ! at org.glassfish.tyrus.client.ClientManager$SameThreadExecutorService.execute(ClientManager.java:849)
      ! at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:118)
      ! at org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:493)
      ! at org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:355)
      ! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.connect(SocketModeClientTyrusImpl.java:170)
      ! ... 9 common frames omitted
      ! Causing: java.io.IOException: javax.websocket.DeploymentException: Handshake error.
      ! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.connect(SocketModeClientTyrusImpl.java:173)
      ! at com.slack.api.socket_mode.SocketModeClient.connectToNewEndpoint(SocketModeClient.java:67)
      ! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.maintainCurrentSession(SocketModeClientTyrusImpl.java:137)
      ! at com.slack.api.socket_mode.SocketModeClient.lambda$initializeSessionMonitorExecutor$1(SocketModeClient.java:180)
      ! at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      ! at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
      ! at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
      ! at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      ! at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      ! at java.base/java.lang.Thread.run(Thread.java:834)
      
    • [2021-08-27T17:06:35,818Z](socket-mode-message-processor-worker-37)([]) trace_id= ERROR - SocketModeClient - Failed to reconnect to Socket Mode server: javax.websocket.DeploymentException: Handshake response not received.
      ! javax.websocket.DeploymentException: Handshake response not received.
      ! at org.glassfish.tyrus.client.ClientManager$3$1.run(ClientManager.java:678)
      ! at org.glassfish.tyrus.client.ClientManager$3.run(ClientManager.java:696)
      ! at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      ! at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      ! at org.glassfish.tyrus.client.ClientManager$SameThreadExecutorService.execute(ClientManager.java:849)
      ! at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:118)
      ! at org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:493)
      ! at org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:355)
      ! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.connect(SocketModeClientTyrusImpl.java:170)
      ! ... 9 common frames omitted
      ! Causing: java.io.IOException: javax.websocket.DeploymentException: Handshake response not received.
      ! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.connect(SocketModeClientTyrusImpl.java:173)
      ! at com.slack.api.socket_mode.SocketModeClient.connectToNewEndpoint(SocketModeClient.java:67)
      ! at com.slack.api.socket_mode.SocketModeClient.processMessage(SocketModeClient.java:350)
      ! at com.slack.api.socket_mode.SocketModeClient.lambda$initializeMessageProcessorExecutor$0(SocketModeClient.java:157)
      ! at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      ! at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
      ! at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
      ! at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      ! at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      ! at java.base/java.lang.Thread.run(Thread.java:834)
      
  • after an unhandled exception occurred it kills one thread in TP and we lose one important functionality
    • recreate a new session for the first presented log case
    • consume\process messages for the second presented log case

The Slack SDK version

[INFO] +- com.slack.api:bolt:jar:1.8.1:compile
[INFO] |  +- com.slack.api:slack-api-model:jar:1.8.1:compile
[INFO] |  +- com.slack.api:slack-api-client:jar:1.8.1:compile
[INFO] |  \- com.slack.api:slack-app-backend:jar:1.8.1:compile
[INFO] +- com.slack.api:bolt-socket-mode:jar:1.8.1:compile
[INFO] |  +- com.slack.api:bolt:jar:1.8.1:runtime
[INFO] |  |  +- com.slack.api:slack-api-model:jar:1.8.1:runtime
[INFO] |  |  +- com.slack.api:slack-api-client:jar:1.8.1:runtime
[INFO] |  |  \- com.slack.api:slack-app-backend:jar:1.8.1:runtime
[INFO] |  +- com.slack.api:bolt-socket-mode:jar:1.8.1:runtime
[INFO]    +- com.slack.api:bolt:jar:1.8.1:compile
[INFO]    |  +- com.slack.api:slack-api-model:jar:1.8.1:compile
[INFO]    |  +- com.slack.api:slack-api-client:jar:1.8.1:compile
[INFO]    |  \- com.slack.api:slack-app-backend:jar:1.8.1:compile
[INFO]    +- com.slack.api:bolt-socket-mode:jar:1.8.1:compile
[INFO] |  |  +- com.slack.api:bolt:jar:1.8.1:runtime
[INFO] |  |  |  +- com.slack.api:slack-api-model:jar:1.8.1:runtime
[INFO] |  |  |  +- com.slack.api:slack-api-client:jar:1.8.1:runtime
[INFO] |  |  |  \- com.slack.api:slack-app-backend:jar:1.8.1:runtime
[INFO] |  |  +- com.slack.api:bolt-socket-mode:jar:1.8.1:runtime

OS info

Inside the openjdk11:jdk-11.0.7_10-alpine-slim

Steps to reproduce:

  • can't say for sure. It just happens after javax.websocket.DeploymentExceptions. Please see my assumptions and presented logs.

Expected result:

The app is "usable", able to react to events in a common way after occurred javax.websocket.DeploymentExceptions

Actual result:

The app is "unusable". Like a zombie, unable to react to events in a common way after occurred javax.websocket.DeploymentExceptions, but isn't totally crashed.

@ddovgal
Copy link
Author

ddovgal commented Sep 1, 2021

In addition to the second presented log (message-processor-worker thread):

  • before presented exception log, ~20 seconds before that there was another exception, that I think is related to message-processor-worker thread failure:
[2021-08-27T17:06:13,929Z](socket-mode-session-cleaner-worker-800)([]) trace_id= ERROR - SocketModeClient - Failed to reconnect to Socket Mode server: javax.websocket.DeploymentException: Handshake error.
! org.glassfish.tyrus.core.HandshakeException: Response code was not 101: 408.
! at org.glassfish.tyrus.client.TyrusClientEngine.processResponse(TyrusClientEngine.java:299)
! at org.glassfish.tyrus.container.grizzly.client.GrizzlyClientFilter.handleHandshake(GrizzlyClientFilter.java:322)
! at org.glassfish.tyrus.container.grizzly.client.GrizzlyClientFilter.handleRead(GrizzlyClientFilter.java:291)
! at org.glassfish.grizzly.filterchain.ExecutorResolver$9.execute(ExecutorResolver.java:119)
! at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:283)
! at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.java:200)
! at org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:132)
! at org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:111)
! at org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:77)
! at org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:536)
! at org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:112)
! at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.run0(WorkerThreadIOStrategy.java:117)
! at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.access$100(WorkerThreadIOStrategy.java:56)
! at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy$WorkerThreadRunnable.run(WorkerThreadIOStrategy.java:137)
! at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:591)
! at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:571)
! ... 1 common frames omitted
! Causing: javax.websocket.DeploymentException: Handshake error.
! at org.glassfish.tyrus.client.ClientManager$3$1.run(ClientManager.java:658)
! at org.glassfish.tyrus.client.ClientManager$3.run(ClientManager.java:696)
! at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
! at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
! at org.glassfish.tyrus.client.ClientManager$SameThreadExecutorService.execute(ClientManager.java:849)
! at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:118)
! at org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:493)
! at org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:355)
! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.connect(SocketModeClientTyrusImpl.java:170)
! ... 22 common frames omitted
! Causing: java.io.IOException: javax.websocket.DeploymentException: Handshake error.
! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.connect(SocketModeClientTyrusImpl.java:173)
! at com.slack.api.socket_mode.SocketModeClient.connectToNewEndpoint(SocketModeClient.java:67)
! at com.slack.api.socket_mode.SocketModeClient.runCloseListenersAndAutoReconnectAsNecessary(SocketModeClient.java:257)
! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.onClose(SocketModeClientTyrusImpl.java:242)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at java.base/java.lang.reflect.Method.invoke(Method.java:566)
! at org.glassfish.tyrus.core.AnnotatedEndpoint.callMethod(AnnotatedEndpoint.java:552)
! at org.glassfish.tyrus.core.AnnotatedEndpoint.onClose(AnnotatedEndpoint.java:567)
! at org.glassfish.tyrus.core.AnnotatedEndpoint.onClose(AnnotatedEndpoint.java:576)
! at org.glassfish.tyrus.core.TyrusEndpointWrapper.onClose(TyrusEndpointWrapper.java:1235)
! at org.glassfish.tyrus.core.TyrusWebSocket.onClose(TyrusWebSocket.java:110)
! at org.glassfish.tyrus.core.ProtocolHandler.close(ProtocolHandler.java:481)
! at org.glassfish.tyrus.core.TyrusWebSocket.close(TyrusWebSocket.java:244)
! at org.glassfish.tyrus.core.TyrusWebSocket.close(TyrusWebSocket.java:254)
! at org.glassfish.tyrus.core.TyrusRemoteEndpoint.close(TyrusRemoteEndpoint.java:480)
! at org.glassfish.tyrus.core.TyrusSession.close(TyrusSession.java:210)
! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.closeSession(SocketModeClientTyrusImpl.java:296)
! at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.lambda$setCurrentSession$1(SocketModeClientTyrusImpl.java:278)
! at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
! at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
! at java.base/java.lang.Thread.run(Thread.java:834)

@filmaj
Copy link
Contributor

filmaj commented Sep 1, 2021

Hello @ddovgal, thanks for writing this in. This is an interesting scenario. I will summarize my observations and perhaps that can help in identifying a) what is happening, b) what, if anything, we can do to work around this issue in the short term and c) how to mitigate the problem in the long term. Identifying what is happening is key to address the issue properly.

In all three scenarios / stack traces you posted, we think the exception arises from this java-slack-sdk code area.

Quick clarifying question: is your third log case scenario presented in your last comment in combination with the second log case scenario presented in your original post? I want to make sure I understand the timeline of the stack traces properly, and it seems if the third log case is what precedes / causes the second log case, because the stack traces for the third and first are identical, it may be the same single underlying cause for the issue.

As you mentioned in the first log case, based on the end of the stack trace, it seems a reconnection is in process (based on the execution of maintainCurrentSession). BTW, do you see any log outputs matching the logging from maintainCurrentSession? During the reconnection process in this first log case, it seems like the underlying Tyrus implementation expects an HTTP 101 response code as part of the WebSocket protocol upgrade negotiation, but in this scenario, the Slack API (I believe in this scenario the java-slack-sdk is attempting to connect to the response returned by the Slack apps.connections.open API) returns an HTTP 408 Request Timed Out. I wonder why this may be happening?

In general, I am curious as to your reasoning on why you think this has to do with threading? May you elaborate more on this point?

As for a workaround or mitigation, it seems like Tyrus allows for the definition of retry behaviour, and allows for the use of our own reconnect handler. Perhaps that is one avenue we can explore for this scenario?

@filmaj filmaj self-assigned this Sep 1, 2021
@filmaj filmaj added needs info question M-T: User needs support to use the project and removed untriaged labels Sep 1, 2021
@seratch
Copy link
Member

seratch commented Sep 2, 2021

@ddovgal Thanks for sharing the details (really helpful!)

but in this scenario, the Slack API (I believe in this scenario the java-slack-sdk is attempting to connect to the response returned by the Slack apps.connections.open API) returns an HTTP 408 Request Timed Out. I wonder why this may be happening?

I'm also curious about this part. As far as I know, we may sometimes see "HTTP 408 Request Timed Out" when the app is behind a proxy/corporate firewall.

Even if an app is not behind a proxy, some connectivity issue can arise in between the Slack server-side and your app hosting infra. According to the information you've shared in the description, the app is running in a Docker container. Thus, another possibility is that some connectivity issue between the app container and outside might be happening.

As @filmaj suggested, Tyrus's retry option may help. With that being said, I'm wondering why the reconnection stops working in this scenario. Once we figure out the room for improvement on this SDK side, we are happy to quickly release the fix for it. Your continued help for the investigation would be greatly appreciated! 🙇

@ddovgal
Copy link
Author

ddovgal commented Sep 8, 2021

Sorry for such a delay in response. I'm currently in a vacation, so will be able to give more details on this case only on next week.

Hope It will occurs again in this week, so I'll be able to gather more information from logs.

@Havelock-Vetinari
Copy link

Hi! I have similar issue with Socket Mode app written in kotlin on top of Spring Framwork. It occurs rather often right from very beginning of socket mode introduction.
Collected logs were not helpful, so now my app is running in trace level for "org.glassfish.tyrus." and in debug for "com.slack.api.socket_mode."
Hopefully it will reveal some new details.

Is there any way to intercept event of such connection drop? Such event could be used to mark app as unhealthy.

@seratch
Copy link
Member

seratch commented Nov 25, 2021

Hi @Havelock-Vetinari,

ERROR - SocketModeClient - Failed to reconnect to Socket Mode server: javax.websocket.DeploymentException: Handshake error.
! org.glassfish.tyrus.core.HandshakeException: Response code was not 101: 408.

Are you seeing this error log too? If so, the handshake error should be happening inside the Tyrus library. As I mentioned in my last comment, this can be a kind of connectivity issue between your app and the Slack server-side.

Is there any way to intercept event of such connection drop? Such event could be used to mark app as unhealthy.

If the close listeners do not work for you, there is nothing else as far as I know.

I've never reproduced the situation described here on my end. My understanding of the issue described here is still that, due to the inactive WebSocket connection, the SocketModeApp is no longer able to receive any data from Slack. If we can make the app reconnect to the Slack server-side in some way, the app should work again even in this scenario (because other parts like thread pool for message handling are still working -- awaiting for incoming messages).

As @filmaj suggested, Tyrus's retry option may help.

If Tyrus's retry option is useful for you, we may consider enabling the retry option by default. Can anyone try this option?

@Havelock-Vetinari
Copy link

Hi @seratch!

Are you seeing this error log too? If so, the handshake error should be happening inside the Tyrus library. As I mentioned in my last comment, this can be a kind of connectivity issue between your app and the Slack server-side.

My application was running on log level=DEBUG for com.slack.api.socket_mode.SocketModeClient and there was simply silence in logs when issue occurred. Even entries about session maintenance were gone.

Now I have enabled more detailed logs for Tyrus so I'm waiting for next occurrence of this issue.

I will also check close listener and retry option in Tyrus.

@github-actions
Copy link

👋 It looks like this issue has been open for 30 days with no activity. We'll mark this as stale for now, and wait 10 days for an update or for further comment before closing this issue out. If you think this issue needs to be prioritized, please comment to get the thread going again! Maintainers also review issues marked as stale on a regular basis and comment or adjust status if the issue needs to be reprioritized.

@Havelock-Vetinari
Copy link

Hi! Issue with loosing socket mode connection still exists. Enabling TRACE on Tyrus and SDK packages didn't help. Last message in logs was single "ping" from my application without "pong" received and without any new attempt maintain websocket session.

There is similar issue in python Slack SDK: slackapi/python-slack-sdk#1110

@mwbrooks
Copy link
Member

Hey @Havelock-Vetinari,

Thanks for the response! I've updated this issue to be ignored by our auto-triage bot.

Aside from enabling more logging, did you have an opportunity to try:

  1. Using the close listener?
  2. Tyrus retry option?

Since we’ve been unable to replicate this on our end, it's challenging to resolve the problem. This certainly looks like a legitimate issue and I believe @seratch is eager to mitigate the issue - so he may follow-up with more questions. In the meantime, any additional information that you have provide would be very helpful!

@icelik
Copy link

icelik commented Aug 17, 2022

I faced a similar issue and upgrading tyrus and bolt did not help. When i inspect the thread dump, i realized a bug on JDK SSLEngine causes this behaviour. Upgrading openjdk 11.02 to 17 .02 immediately fixed the problem.

see https://bugs.openjdk.org/browse/JDK-8241054

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-triage-skip needs info question M-T: User needs support to use the project
Projects
None yet
Development

No branches or pull requests

6 participants