Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can serf recover a single cluster following a "long" network partition? #605

Open
davidMcneil opened this issue May 29, 2020 · 2 comments
Open

Comments

@davidMcneil
Copy link

If there is a network partition for a long period of time can serf automatically recover? To define some terms:

  • long - longer than suspicion timeouts (ie all nodes across the partition are confirmed dead)
  • recover - combine the resulting two clusters into a single cluster

For example, there are four nodes A, B, C, and D. A network partition causes A and B to be isolated into their own cluster, and C and D are isolated into a separate cluster. Now the network is fixed. Will the two split clusters be able to recover and form a single cluster with all four nodes?

One could manually heal the split by adding a node with peers from both clusters.

Thanks for the awesome project!

@max19931
Copy link

max19931 commented Jan 28, 2022

"Suspicion" timeouts will turn into death("Dead") events, which the node in question could revoke it turning intself "Alive" again.
NetPartition will make this impossible, so the node is marked as dead.

after the serf node is marked as dead the member will be only keep around until the member list is cleaned up, which will remove dead nodes then. after that a manuel rejoining is required to reconnect the "Cluster" back together.

https://www.serf.io/docs/internals/gossip.html decribes the failure handling, but a look into the source can also help.

As always most of the settings, specially the times around memberlists, can be set to very high values to keep all noddes around for longer.

@max19931
Copy link

max19931 commented Jan 28, 2022

Any member who know other nodes not connected already to the rest will also be added as members to the list.

  • with snapshots: sure, the recovery timeout just needs to be set to a high value
  • Without Snapshots: As the member is removed from the lists, only a manuel rejoin can bring him and his friends back into the "Cluster"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants