Service is shutting down after reconnecting node to cluster #3689

Zyklop · 2023-01-03T10:14:39Z

Describe the bug
I have a cluster with three nodes. The Cluster is configured using IP-Gossip. When I disconnect the network of the active node, another node is active. But When I reconnect the network of the disconnected node, the node is shutting down instead of reconnecting to the cluster. The issue is happening consistently but not every time, sometimes everything works as expected.

To Reproduce
Steps to reproduce the behavior:

Start cluster
Disconnect the active node from the network
Reconnect the disconnected node back to the network

Expected behavior
The node should reconnect to the cluster. (This works sometimes)

Actual behavior
The node is sending a gossip to the other nodes and then shutting down. (See log below)

Config/Logs/Screenshots
[ 7116,18,10:42:34.874,INF] ES "TcpConnection" closed [09:42:34.874: N"10.11.0.4:1112", L"10.10.0.3:54707", {b05c5910-e6b5-44c4-b2a3-b3a714f96da8}]:Received bytes: 5769, Sent bytes: 659
[ 7116,18,10:42:34.874,INF] ES "TcpConnection" closed [09:42:34.874: N"10.11.0.4:1112", L"10.10.0.3:54707", {b05c5910-e6b5-44c4-b2a3-b3a714f96da8}]:Send calls: 1, callbacks: 1
[ 7116,18,10:42:34.874,INF] ES "TcpConnection" closed [09:42:34.874: N"10.11.0.4:1112", L"10.10.0.3:54707", {b05c5910-e6b5-44c4-b2a3-b3a714f96da8}]:Receive calls: 14, callbacks: 14
[ 7116,18,10:42:34.874,INF] ES "TcpConnection" closed [09:42:34.874: N"10.11.0.4:1112", L"10.10.0.3:54707", {b05c5910-e6b5-44c4-b2a3-b3a714f96da8}]:Close reason: [Success] "Node state changed to ShuttingDown. Closing replication connection."
[ 7116,18,10:42:34.874,INF] Connection '"leader-normal"""' ["10.11.0.4:1112", {b05c5910-e6b5-44c4-b2a3-b3a714f96da8}] closed: Success.
[ 7116,69,10:42:34.874,DBG] Persistent Subscriptions have been stopped.
[ 7116,69,10:42:34.874,DBG] Persistent subscriptions received state change to ShuttingDown. Stopping listening
[ 7116,69,10:42:34.874,DBG] Persistent Subscriptions have been stopped.
[ 7116,18,10:42:34.875,DBG] SLOW QUEUE MSG ["MainQueue"]: "RequestShutdown" - 488ms. Q: 0/25.
[ 7116,69,10:42:34.876,INF] Live subscription 79fc7be1-8260-4d92-b3f1-2dc87e668cb8 to "GlobalCommands" disposed.
[ 7116,18,10:42:34.876,INF] ========== ["10.10.0.3:2113"] Service '"ReplicationTrackingService"' has shut down.
[ 7116,18,10:42:34.876,INF] Looks like node ["Unspecified/10.10.0.2:2113"] is DEAD (Gossip send failed).
[ 7116,44,10:42:34.876,DBG] Live subscription b8d39015-9c95-43a1-a875-2b228a174717 to "WdpConfig" dropped: Unsubscribed
[ 7116,44,10:42:34.876,DBG] Live subscription 2042b5a9-a024-4281-b91a-fc77fed67e89 to "WdpInfo" dropped: Unsubscribed
[ 7116,44,10:42:34.876,DBG] Live subscription f64780d9-aed5-4be7-996b-5b4e2abbcf64 to "WdpConfig" dropped: Unsubscribed
[ 7116,44,10:42:34.876,DBG] Live subscription 5fce7bfb-a1e8-4ab2-88d5-a581f190783d to "WdpInfo" dropped: Unsubscribed
[ 7116,18,10:42:34.876,INF] CLUSTER HAS CHANGED "gossip send failed to [Unspecified/10.10.0.2:2113]"
Old:
["Priority: 0 VND {5e565c72-fb3d-4d46-9e55-3fe2c8a0e3e0} [Leader, Unspecified/10.11.0.4:1112, n/a, Unspecified/10.11.0.4:1113, n/a, Unspecified/10.11.0.4:2113, (ADVERTISED: HTTP::0, TCP::0)] 266888975/266889129/266889129/E390@266887833:{1d36aa56-c23f-4d93-8b2f-ec5bff2a3dff} | 2023-01-03 09:42:34.371", "Priority: 0 VND {a21af6f5-ba4f-4320-ae79-c0307b65dd19} [ShuttingDown, Unspecified/10.10.0.3:1112, n/a, Unspecified/10.10.0.3:1113, n/a, Unspecified/10.10.0.3:2113, (ADVERTISED: HTTP::0, TCP::0)] 266885324/266885790/266885790/E387@266817565:{6ae6a7df-297d-4490-8e81-c92e1c08ed12} | 2023-01-03 09:42:34.874", "Priority: 0 VND {2806dd5c-33cf-4eb0-bf73-4684118e1a8f} [Unknown, Unspecified/10.10.0.2:1112, n/a, Unspecified/10.10.0.2:1113, n/a, Unspecified/10.10.0.2:2113, (ADVERTISED: HTTP::0, TCP::0)] 266888667/266888821/266888821/E390@266887833:{1d36aa56-c23f-4d93-8b2f-ec5bff2a3dff} | 2023-01-03 09:42:34.370"]
New:
["Priority: 0 VND {5e565c72-fb3d-4d46-9e55-3fe2c8a0e3e0} [Leader, Unspecified/10.11.0.4:1112, n/a, Unspecified/10.11.0.4:1113, n/a, Unspecified/10.11.0.4:2113, (ADVERTISED: HTTP::0, TCP::0)] 266888975/266889129/266889129/E390@266887833:{1d36aa56-c23f-4d93-8b2f-ec5bff2a3dff} | 2023-01-03 09:42:34.371", "Priority: 0 VND {a21af6f5-ba4f-4320-ae79-c0307b65dd19} [ShuttingDown, Unspecified/10.10.0.3:1112, n/a, Unspecified/10.10.0.3:1113, n/a, Unspecified/10.10.0.3:2113, (ADVERTISED: HTTP::0, TCP::0)] 266885324/266885790/266885790/E387@266817565:{6ae6a7df-297d-4490-8e81-c92e1c08ed12} | 2023-01-03 09:42:34.874", "Priority: 0 VND {2806dd5c-33cf-4eb0-bf73-4684118e1a8f} [Unknown, Unspecified/10.10.0.2:1112, n/a, Unspecified/10.10.0.2:1113, n/a, Unspecified/10.10.0.2:2113, (ADVERTISED: HTTP::0, TCP::0)] 266888667/266888821/266888821/E390@266887833:{1d36aa56-c23f-4d93-8b2f-ec5bff2a3dff} | 2023-01-03 09:42:34.876"]
[ 7116,18,10:42:34.876,INF] ========== ["10.10.0.3:2113"] Service '"StorageWriter"' has shut down.
[ 7116,18,10:42:34.876,INF] ========== ["10.10.0.3:2113"] Service '"StorageReader"' has shut down.
[ 7116,18,10:42:34.876,INF] ========== ["10.10.0.3:2113"] Service '"HttpServer [10.10.0.3:2113]"' has shut down.
[ 7116,18,10:42:34.877,INF] Looks like node ["Unspecified/10.11.0.4:1112"] is DEAD (TCP connection lost). Issuing a gossip to confirm.
[ 7116,18,10:42:34.883,INF] Gossip Received, The node ["Unspecified/10.11.0.4:2113"] is not DEAD.
[ 7116,18,10:42:34.885,INF] ========== ["10.10.0.3:2113"] Service '"Storage Chaser"' has shut down.
[ 7116,18,10:42:34.947,INF] ========== ["10.10.0.3:2113"] Service '"Index Committer"' has shut down.
[ 7116,18,10:42:34.947,INF] ========== ["10.10.0.3:2113"] All Services Shutdown.
[ 7116,18,10:42:34.982,INF] ========== ["10.10.0.3:2113"] IS SHUT DOWN.
[ 7116,18,10:42:35.017,INF] Exiting with exit code: 0.
Exit reason: "Shutdown and exit from process was requested."
[ 7116,18,10:42:35.017,DBG] SLOW QUEUE MSG ["MainQueue"]: "ServiceShutdown" - 70ms. Q: 0/3.
[ 7116,59,10:42:40.026,INF] Live subscription 2042b5a9-a024-4281-b91a-fc77fed67e89 to "WdpInfo" disposed.
[ 7116,38,10:42:40.026,INF] Live subscription b8d39015-9c95-43a1-a875-2b228a174717 to "WdpConfig" disposed.
[ 7116,69,10:42:40.026,INF] Live subscription f64780d9-aed5-4be7-996b-5b4e2abbcf64 to "WdpConfig" disposed.
[ 7116,71,10:42:40.026,INF] Live subscription 5fce7bfb-a1e8-4ab2-88d5-a581f190783d to "WdpInfo" disposed.

EventStore details

EventStore server version: 21.10.8
Operating system: Windows Server 2022
EventStore client version (if applicable):

Additional context

hayley-jean · 2023-01-04T11:03:39Z

Hi @Zyklop, could you please check the logs for the node that is being shut down for the line:

Leader [{leaderEndPoint},{leaderId:B}] subscribed us at {subscriptionPosition} (0x{subscriptionPosition:X}), which is less than our writer checkpoint {writerCheckpoint} (0x{writerCheckpoint:X}). TRUNCATION IS NEEDED.

If you find this line then it means that the node went offline for truncation, which is expected behaviour.
This usually happens when a Leader node is disconnected from the cluster, and then rejoins the cluster after a new Leader has been elected.

If you are running this on Windows, we recommend that you run EventStore as a service so that the process is automatically restarted when something like this occurs.

Zyklop · 2023-01-05T08:49:11Z

Thanks @hayley-jean
I found this line in the logs.
The problem was that the service wasn't restarted because the exit code was 0. That's why our service wasn't restarted automatically. Maybe exit with an error code would be more appropriate in this case?

ylorph · 2023-01-23T09:30:35Z

Exit codes conventions
https://devblogs.microsoft.com/oldnewthing/20110519-00/?p=10623

hayley-jean · 2023-01-24T15:17:11Z

While I don't think offline truncation should be an error code, having a different exit code for offline truncation vs user shutdown would be helpful

We have an issue about having unique exit codes here: #3160

Zyklop added the kind/bug Issues which are a software defect label Jan 3, 2023

ylorph added the quality of life Issues that are fairly minor, but which improve the usage of EventStoreDB label Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service is shutting down after reconnecting node to cluster #3689

Service is shutting down after reconnecting node to cluster #3689

Zyklop commented Jan 3, 2023 •

edited

hayley-jean commented Jan 4, 2023

Zyklop commented Jan 5, 2023

ylorph commented Jan 23, 2023

hayley-jean commented Jan 24, 2023

Service is shutting down after reconnecting node to cluster #3689

Service is shutting down after reconnecting node to cluster #3689

Comments

Zyklop commented Jan 3, 2023 • edited

hayley-jean commented Jan 4, 2023

Zyklop commented Jan 5, 2023

ylorph commented Jan 23, 2023

hayley-jean commented Jan 24, 2023

Zyklop commented Jan 3, 2023 •

edited