Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zk 偶尔连接失败会自动转移到某个默认地址,导致整个服务不可用 #2247

Open
kevinsir opened this issue Jul 30, 2023 · 7 comments

Comments

@kevinsir
Copy link

kevinsir commented Jul 30, 2023

这个太致命了。。。某一天突然出现。。。172.16.96.194 某一次的连接失败,导致整个地址转移到了 172.19.0.102
但是内网根本没有这个机器

version

elasticjob-lite-core==3.0.3
elasticjob-lite-spring-boot-starter==3.0.3
org.springframework.boot=2.5.5

Bug Report

2023-07-30 01:52:28 WARN [main-SendThread(172.16.96.194:2182)] org.apache.zookeeper.ClientCnxn:1229 - Client session timed out, have not heard from server in 34224ms for session id 0x1076b6f21a50011
2023-07-30 01:52:28 WARN [main-SendThread(172.16.96.194:2182)] org.apache.zookeeper.ClientCnxn:1272 - Session 0x1076b6f21a50011 for sever 172.16.96.194/172.16.96.194:2182, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 34224ms for session id 0x1076b6f21a50011
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1230)
2023-07-30 01:52:55 WARN [main-SendThread(172.19.0.102:2181)] org.apache.zookeeper.ClientCnxn:1229 - Client session timed out, have not heard from server in 13675ms for session id 0x1076b6f21a50011
2023-07-30 01:53:10 WARN [main-SendThread(172.19.0.102:2181)] org.apache.zookeeper.ClientCnxn:1272 - Session 0x1076b6f21a50011 for sever 172.19.0.102/172.19.0.102:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 13675ms for session id 0x1076b6f21a50011

@songxiaosheng
Copy link
Member

zookeeper服务端版本是哪个 低版本zookeeper 存在session timed out 问题

@kevinsir
Copy link
Author

kevinsir commented Aug 8, 2023

哦。。。非常感谢哈。
我这个zookeeper的版本是 3.7.0

@songxiaosheng
Copy link
Member

那应该是客户端没有给zookeeper发送心跳信息导致的session失效,检查下是否有网络异常或者心跳线程崩溃的情况

@kevinsir
Copy link
Author

额。网络异常应该会有吧。。。毕竟不常见,几个月会有一次。。。但是这玩意不能自动修复么。。。另外咨询一下有没有失效报警的机制。。。感觉问题挺多的。。。

@songxiaosheng
Copy link
Member

zookeeper服务端监控可以使用四字命令, 对于这个session失效日志目前zookeeper没见到现成的,你们可以通过筛选日志中的关键词匹配到的时候进行告警。

@kevinsir
Copy link
Author

kevinsir commented Dec 25, 2023

zk 是没有失效的,现在2-3天就出现。。。只能重启才行。这故障转移有点奇葩

2023-12-25 09:06:26 WARN [main-SendThread(172.19.0.101:2181)] org.apache.zookeeper.ClientCnxn:1229 - Client session timed out, have not heard from server in 13734ms for session id 0x20ae3531fbf001e 2023-12-25 09:06:26 WARN [main-SendThread(172.19.0.101:2181)] org.apache.zookeeper.ClientCnxn:1272 - Session 0x20ae3531fbf001e for sever 172.19.0.101/172.19.0.101:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 13734ms for session id 0x20ae3531fbf001e at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1230) 2023-12-25 09:06:27 WARN [main-SendThread(0.0.0.0:2181)] org.apache.zookeeper.ClientCnxn:1272 - Session 0x20ae3531fbf001e for sever 0.0.0.0/0.0.0.0:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:342) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1262)

@linghengqian
Copy link
Member

  • Hmm, this looks like a Curator issue to be honest. Have you tried ElasticJob 3.0.4? ElasticJob 3.0.4 uses newer versions of Zookeeper Client and Curator Client.

  • It would be nice to have a unit test using curator-test or testcontainers, I can't find a reproducible unit test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants