Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service is shutting down after reconnecting node to cluster #3689

Open
Zyklop opened this issue Jan 3, 2023 · 4 comments
Open

Service is shutting down after reconnecting node to cluster #3689

Zyklop opened this issue Jan 3, 2023 · 4 comments
Labels
kind/bug Issues which are a software defect quality of life Issues that are fairly minor, but which improve the usage of EventStoreDB

Comments

@Zyklop
Copy link

Zyklop commented Jan 3, 2023

Describe the bug
I have a cluster with three nodes. The Cluster is configured using IP-Gossip. When I disconnect the network of the active node, another node is active. But When I reconnect the network of the disconnected node, the node is shutting down instead of reconnecting to the cluster. The issue is happening consistently but not every time, sometimes everything works as expected.

To Reproduce
Steps to reproduce the behavior:

  1. Start cluster
  2. Disconnect the active node from the network
  3. Reconnect the disconnected node back to the network

Expected behavior
The node should reconnect to the cluster. (This works sometimes)

Actual behavior
The node is sending a gossip to the other nodes and then shutting down. (See log below)

Config/Logs/Screenshots
[ 7116,18,10:42:34.874,INF] ES "TcpConnection" closed [09:42:34.874: N"10.11.0.4:1112", L"10.10.0.3:54707", {b05c5910-e6b5-44c4-b2a3-b3a714f96da8}]:Received bytes: 5769, Sent bytes: 659
[ 7116,18,10:42:34.874,INF] ES "TcpConnection" closed [09:42:34.874: N"10.11.0.4:1112", L"10.10.0.3:54707", {b05c5910-e6b5-44c4-b2a3-b3a714f96da8}]:Send calls: 1, callbacks: 1
[ 7116,18,10:42:34.874,INF] ES "TcpConnection" closed [09:42:34.874: N"10.11.0.4:1112", L"10.10.0.3:54707", {b05c5910-e6b5-44c4-b2a3-b3a714f96da8}]:Receive calls: 14, callbacks: 14
[ 7116,18,10:42:34.874,INF] ES "TcpConnection" closed [09:42:34.874: N"10.11.0.4:1112", L"10.10.0.3:54707", {b05c5910-e6b5-44c4-b2a3-b3a714f96da8}]:Close reason: [Success] "Node state changed to ShuttingDown. Closing replication connection."
[ 7116,18,10:42:34.874,INF] Connection '"leader-normal"""' ["10.11.0.4:1112", {b05c5910-e6b5-44c4-b2a3-b3a714f96da8}] closed: Success.
[ 7116,69,10:42:34.874,DBG] Persistent Subscriptions have been stopped.
[ 7116,69,10:42:34.874,DBG] Persistent subscriptions received state change to ShuttingDown. Stopping listening
[ 7116,69,10:42:34.874,DBG] Persistent Subscriptions have been stopped.
[ 7116,18,10:42:34.875,DBG] SLOW QUEUE MSG ["MainQueue"]: "RequestShutdown" - 488ms. Q: 0/25.
[ 7116,69,10:42:34.876,INF] Live subscription 79fc7be1-8260-4d92-b3f1-2dc87e668cb8 to "GlobalCommands" disposed.
[ 7116,18,10:42:34.876,INF] ========== ["10.10.0.3:2113"] Service '"ReplicationTrackingService"' has shut down.
[ 7116,18,10:42:34.876,INF] Looks like node ["Unspecified/10.10.0.2:2113"] is DEAD (Gossip send failed).
[ 7116,44,10:42:34.876,DBG] Live subscription b8d39015-9c95-43a1-a875-2b228a174717 to "WdpConfig" dropped: Unsubscribed
[ 7116,44,10:42:34.876,DBG] Live subscription 2042b5a9-a024-4281-b91a-fc77fed67e89 to "WdpInfo" dropped: Unsubscribed
[ 7116,44,10:42:34.876,DBG] Live subscription f64780d9-aed5-4be7-996b-5b4e2abbcf64 to "WdpConfig" dropped: Unsubscribed
[ 7116,44,10:42:34.876,DBG] Live subscription 5fce7bfb-a1e8-4ab2-88d5-a581f190783d to "WdpInfo" dropped: Unsubscribed
[ 7116,18,10:42:34.876,INF] CLUSTER HAS CHANGED "gossip send failed to [Unspecified/10.10.0.2:2113]"
Old:
["Priority: 0 VND {5e565c72-fb3d-4d46-9e55-3fe2c8a0e3e0} [Leader, Unspecified/10.11.0.4:1112, n/a, Unspecified/10.11.0.4:1113, n/a, Unspecified/10.11.0.4:2113, (ADVERTISED: HTTP::0, TCP::0)] 266888975/266889129/266889129/E390@266887833:{1d36aa56-c23f-4d93-8b2f-ec5bff2a3dff} | 2023-01-03 09:42:34.371", "Priority: 0 VND {a21af6f5-ba4f-4320-ae79-c0307b65dd19} [ShuttingDown, Unspecified/10.10.0.3:1112, n/a, Unspecified/10.10.0.3:1113, n/a, Unspecified/10.10.0.3:2113, (ADVERTISED: HTTP::0, TCP::0)] 266885324/266885790/266885790/E387@266817565:{6ae6a7df-297d-4490-8e81-c92e1c08ed12} | 2023-01-03 09:42:34.874", "Priority: 0 VND {2806dd5c-33cf-4eb0-bf73-4684118e1a8f} [Unknown, Unspecified/10.10.0.2:1112, n/a, Unspecified/10.10.0.2:1113, n/a, Unspecified/10.10.0.2:2113, (ADVERTISED: HTTP::0, TCP::0)] 266888667/266888821/266888821/E390@266887833:{1d36aa56-c23f-4d93-8b2f-ec5bff2a3dff} | 2023-01-03 09:42:34.370"]
New:
["Priority: 0 VND {5e565c72-fb3d-4d46-9e55-3fe2c8a0e3e0} [Leader, Unspecified/10.11.0.4:1112, n/a, Unspecified/10.11.0.4:1113, n/a, Unspecified/10.11.0.4:2113, (ADVERTISED: HTTP::0, TCP::0)] 266888975/266889129/266889129/E390@266887833:{1d36aa56-c23f-4d93-8b2f-ec5bff2a3dff} | 2023-01-03 09:42:34.371", "Priority: 0 VND {a21af6f5-ba4f-4320-ae79-c0307b65dd19} [ShuttingDown, Unspecified/10.10.0.3:1112, n/a, Unspecified/10.10.0.3:1113, n/a, Unspecified/10.10.0.3:2113, (ADVERTISED: HTTP::0, TCP::0)] 266885324/266885790/266885790/E387@266817565:{6ae6a7df-297d-4490-8e81-c92e1c08ed12} | 2023-01-03 09:42:34.874", "Priority: 0 VND {2806dd5c-33cf-4eb0-bf73-4684118e1a8f} [Unknown, Unspecified/10.10.0.2:1112, n/a, Unspecified/10.10.0.2:1113, n/a, Unspecified/10.10.0.2:2113, (ADVERTISED: HTTP::0, TCP::0)] 266888667/266888821/266888821/E390@266887833:{1d36aa56-c23f-4d93-8b2f-ec5bff2a3dff} | 2023-01-03 09:42:34.876"]
[ 7116,18,10:42:34.876,INF] ========== ["10.10.0.3:2113"] Service '"StorageWriter"' has shut down.
[ 7116,18,10:42:34.876,INF] ========== ["10.10.0.3:2113"] Service '"StorageReader"' has shut down.
[ 7116,18,10:42:34.876,INF] ========== ["10.10.0.3:2113"] Service '"HttpServer [10.10.0.3:2113]"' has shut down.
[ 7116,18,10:42:34.877,INF] Looks like node ["Unspecified/10.11.0.4:1112"] is DEAD (TCP connection lost). Issuing a gossip to confirm.
[ 7116,18,10:42:34.883,INF] Gossip Received, The node ["Unspecified/10.11.0.4:2113"] is not DEAD.
[ 7116,18,10:42:34.885,INF] ========== ["10.10.0.3:2113"] Service '"Storage Chaser"' has shut down.
[ 7116,18,10:42:34.947,INF] ========== ["10.10.0.3:2113"] Service '"Index Committer"' has shut down.
[ 7116,18,10:42:34.947,INF] ========== ["10.10.0.3:2113"] All Services Shutdown.
[ 7116,18,10:42:34.982,INF] ========== ["10.10.0.3:2113"] IS SHUT DOWN.
[ 7116,18,10:42:35.017,INF] Exiting with exit code: 0.
Exit reason: "Shutdown and exit from process was requested."
[ 7116,18,10:42:35.017,DBG] SLOW QUEUE MSG ["MainQueue"]: "ServiceShutdown" - 70ms. Q: 0/3.
[ 7116,59,10:42:40.026,INF] Live subscription 2042b5a9-a024-4281-b91a-fc77fed67e89 to "WdpInfo" disposed.
[ 7116,38,10:42:40.026,INF] Live subscription b8d39015-9c95-43a1-a875-2b228a174717 to "WdpConfig" disposed.
[ 7116,69,10:42:40.026,INF] Live subscription f64780d9-aed5-4be7-996b-5b4e2abbcf64 to "WdpConfig" disposed.
[ 7116,71,10:42:40.026,INF] Live subscription 5fce7bfb-a1e8-4ab2-88d5-a581f190783d to "WdpInfo" disposed.

EventStore details

  • EventStore server version: 21.10.8

  • Operating system: Windows Server 2022

  • EventStore client version (if applicable):

Additional context

@Zyklop Zyklop added the kind/bug Issues which are a software defect label Jan 3, 2023
@hayley-jean
Copy link
Member

Hi @Zyklop, could you please check the logs for the node that is being shut down for the line:

Leader [{leaderEndPoint},{leaderId:B}] subscribed us at {subscriptionPosition} (0x{subscriptionPosition:X}), which is less than our writer checkpoint {writerCheckpoint} (0x{writerCheckpoint:X}). TRUNCATION IS NEEDED.

If you find this line then it means that the node went offline for truncation, which is expected behaviour.
This usually happens when a Leader node is disconnected from the cluster, and then rejoins the cluster after a new Leader has been elected.

If you are running this on Windows, we recommend that you run EventStore as a service so that the process is automatically restarted when something like this occurs.

@Zyklop
Copy link
Author

Zyklop commented Jan 5, 2023

Thanks @hayley-jean
I found this line in the logs.
The problem was that the service wasn't restarted because the exit code was 0. That's why our service wasn't restarted automatically. Maybe exit with an error code would be more appropriate in this case?

@ylorph
Copy link
Contributor

ylorph commented Jan 23, 2023

@ylorph ylorph added the quality of life Issues that are fairly minor, but which improve the usage of EventStoreDB label Jan 23, 2023
@hayley-jean
Copy link
Member

While I don't think offline truncation should be an error code, having a different exit code for offline truncation vs user shutdown would be helpful

We have an issue about having unique exit codes here: #3160

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues which are a software defect quality of life Issues that are fairly minor, but which improve the usage of EventStoreDB
Projects
None yet
Development

No branches or pull requests

3 participants