Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If leader can't load snapshot cluster won't recover #522

Open
benbuzbee opened this issue Aug 31, 2022 · 3 comments
Open

If leader can't load snapshot cluster won't recover #522

benbuzbee opened this issue Aug 31, 2022 · 3 comments
Labels

Comments

@benbuzbee
Copy link
Contributor

Hello folks! I have a pretty lazy bug report here so apologies for not going deeper but I wanted to float a stance that by you and see if I can get away with it

We had a cluster of nomad servers that lost quorum and would not elect a new leader

Looking at the logs, the leader at the time was logging this

2022-08-17T03:11:20.634Z    2022-08-17T03:11:20.634Z [ERROR] snapshot: failed to get snapshots: error="open /run/nomad-server/server/raft/snapshots: no such file or directory"
2022-08-17T03:11:20.634Z    2022-08-17T03:11:20.634Z [ERROR] snapshot: failed to scan snapshot directory: error="open /run/nomad-server/server/raft/snapshots: no such file or directory"
2022-08-17T03:11:20.634Z    failed to send snapshot to
2022-08-17T03:11:20.634Z    failed to list snapshots
2022-08-17T03:11:20.634Z    failed to get log
2022-08-17T03:11:20.608Z    failed to list snapshots
2022-08-17T03:11:20.608Z    failed to send snapshot to

And other servers were logging this

2022-08-17T03:07:55.677Z    error waiting for Raft index error=timed out after 5s waiting for index=1525203

So here is my stance:
If the leader is broken because it cannot load the snapshots (I have no idea how we got in this situation but lets ignore that for now); the other server should realize the leader is useless and usurp him; perhaps via invoking the Praetorians Guard.

or more down to Earth: this state should cause a heartbeat failure in some way so that we can move past it and elect a new leader.

What do you think?

@ncabatoff
Copy link
Contributor

Hi @benbuzbee,

I'm not persuaded this is something that ought to be handled in the raft library itself. Moreover, the log you cite ("error waiting for Raft index") doesn't look like something from the library, but from Nomad, so it may be that what you're experiencing isn't purely a raft issue. I suggest you file this proposal as an issue on the https://github.com/hashicorp/nomad repo, and the maintainers of that project can decide whether it's better addressed in Nomad or here in the raft library.

@benbuzbee
Copy link
Contributor Author

If that is where you think this best lives. My suggestion here I think was largely because it is where healthy leadership heart beating exists.

Failure to load the snapshots exists in raft file_snapshot.go. Offhand I am not sure where the re-try loop exists but I suspect it is raft.

Does Nomad actually have what it needs to detect raft failing to load and abort the retries and modify the cluster?

@ncabatoff ncabatoff reopened this Jun 19, 2023
@ncabatoff
Copy link
Contributor

Hi @benbuzbee,

I retract what I said earlier: I agree with your original statement

If the leader is broken because it cannot load the snapshots [...] the other server should realize the leader is useless and usurp him

Possible fix: in replicateTo, if we can't load a snapshot, we should step down as leader. The current code specifically doesn't stop replication for this error; it probably should, but there are likely other details we need to consider here.

@ncabatoff ncabatoff added the bug label Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants