You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 30, 2019. It is now read-only.
Upon joining a cluster, I noticed that the actual announcement is significantly delayed. In particular, the debug output shows that a node has joined a full 20 seconds before the OnNodeJoin or OnNewLeaves events are fired.
I tracked this down to line 646 of cluster.go. My understanding of reading through the code and the debug output is that the join process goes like this: (please slap me if I've gone awry)
send join message to cluster
cluster accepts join message and node into cluster (debug output says "Node ... joined!")
cluster sends state tables back to the new node
the new node doesn't know it has successfully joined yet, waits for 2 * NETWORK_TIMEOUT and announces presence
after announcement is sent, the node proclaims itself joined
the OnNodeJoin and OnNewLeaves events are fired for all nodes in the cluster
I'm guessing that there is a reason why there is a delay set to 2 * NETWORK_TIMEOUT, but I'm not sure what it is. (Truthfully, my networking skills are pretty poor, so I dare not hazard a guess.)
I would be very happy to work on a fix for this problem, I'm just not sure what the fix would look like yet. Therefore, I am seeking guidance. :-)
My inclination is to try and announce the node's presence immediately, and if it fails, try again after a longer timeout. I just don't know what if it fails means in this context.
Thanks!
The text was updated successfully, but these errors were encountered:
This is not a bug, but is intended behaviour. It's not the optimal behaviour, I'll agree.
The reason the timeout is set to 2 * NETWORK_TIMEOUT is to ensure that Nodes have the chance to respond with a race condition. Waiting NETWORK_TIMEOUT allows us to be sure that if a message is going to be sent to a Node, it has been sent. Waiting another NETWORK_TIMEOUT allows us to be sure that if a race condition is going to be detected, it has been detected. So waiting 2 * NETWORK_TIMEOUT allows us to make sure that nobody is going to throw a race condition warning after we say we've joined the cluster.
The reason this is important is because suppose we have a cluster with Nodes whose IDs are 1, 4, and 5. Suppose Nodes 2 and 3 join at roughly the same time (e.g., they are both doing the join dance at the same time). Should 5 attempt to route a message with ID 3 while this is happening, it's possible that a race condition could lead to Node 2 to believe it is the closest Node, when in fact 3 is. To counteract this, we wait until a Node has a full representation of its state tables to announce its presence, which is a signal that it's ready to begin handling messages.
For some clusters, this isn't that big an issue. Messages being mis-delivered for a few seconds could be a non-issue entirely. For some clusters, that consistency guarantee--that a Node will never consider a message delivered unless it's sure that it is the closest Node--is really important. To accommodate both situations, the timeout should probably be controllable via a standalone configuration value that just defaults to the stronger consistency guarantee.
Any thoughts on this? I appreciate the feedback.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Upon joining a cluster, I noticed that the actual announcement is significantly delayed. In particular, the debug output shows that a node has joined a full 20 seconds before the
OnNodeJoin
orOnNewLeaves
events are fired.I tracked this down to line 646 of
cluster.go
. My understanding of reading through the code and the debug output is that the join process goes like this: (please slap me if I've gone awry)2 * NETWORK_TIMEOUT
and announces presenceOnNodeJoin
andOnNewLeaves
events are fired for all nodes in the clusterI'm guessing that there is a reason why there is a delay set to
2 * NETWORK_TIMEOUT
, but I'm not sure what it is. (Truthfully, my networking skills are pretty poor, so I dare not hazard a guess.)I would be very happy to work on a fix for this problem, I'm just not sure what the fix would look like yet. Therefore, I am seeking guidance. :-)
My inclination is to try and announce the node's presence immediately, and
if it fails
, try again after a longer timeout. I just don't know whatif it fails
means in this context.Thanks!
The text was updated successfully, but these errors were encountered: