Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should Ra return an error while multiple nodes try to leave the cluster? #211

Open
RoadRunnr opened this issue Feb 2, 2021 · 5 comments

Comments

@RoadRunnr
Copy link

  • tested on 1.1.8

When multiple nodes try to leave the cluster at once (e.g. when trying to stop all nodes in a unit test), it can happen that the master leaves and another node also tries to leave at almost the same time. When the old master process has already terminated, but no new master has been elected yet, a call to ra:leave_and_delete_server on the local node can return {error, noproc}.

IMHO it should be {error,cluster_change_not_permitted} as long as there is at least one node (e.g. the local node) left in the cluster.

@michaelklishin
Copy link
Member

Sounds good. Ra intentionally limits membership changes to one at a time due to how complex reasoning about multi-member changes are. Some other Raft implementations have chosen to adopt the same limitations.

This is open source software, so please submit a PR.

@michaelklishin michaelklishin changed the title error while multiple nodes try to leave the cluster Should Ra return an error while multiple nodes try to leave the cluster? Feb 2, 2021
@RoadRunnr
Copy link
Author

I think this issue should not be a question, maybe I didn't describe the problem sufficiently enough.

When multiple ra nodes try to leave the cluster at about the same time, the result of calling ra:leave_and_delete_server can have the following values:

  • ok
  • {error,cluster_change_not_permitted}
  • {error,noproc}
  • a crash {'EXIT',{normal,{gen_statem,call,[Leader,{leader_call,{command,normal,{'$ra_leave',Node,await_consensus}}},5000]}}}

The {error,noproc} and the crash a clearly abnormal returns.

I would like to submit a PR to fix this bug. But the time required to understand the internals of ra, then fix the bug and possibly write a test case, means that this will happen sometime far, far in the future.

@kjnilsson
Copy link
Contributor

Some of these come from natural race conditions where a leader is removed and a request is issued to the old leader pid. I suspect retries is the only reasonable way forward here

@RoadRunnr
Copy link
Author

Some of these come from natural race conditions where a leader is removed and a request is issued to the old leader pid. I suspect retries is the only reasonable way forward here

That's what I'm doing in my test suite for all the error cases.
However, I would expect that in the noproc and in the crash case, the right thing for the API would be to return {error,cluster_change_not_permitted}. noproc would only make sense in the case where the targeted ServerRef is not running.

@kjnilsson
Copy link
Contributor

noproc is probably returned when the leader has gone away and the call is redirected to the now gone leader. If this is the case (I haven't validated) then I think noproc is appropriate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants