If leader can't load snapshot cluster won't recover #522

benbuzbee · 2022-08-31T21:48:21Z

Hello folks! I have a pretty lazy bug report here so apologies for not going deeper but I wanted to float a stance that by you and see if I can get away with it

We had a cluster of nomad servers that lost quorum and would not elect a new leader

Looking at the logs, the leader at the time was logging this

2022-08-17T03:11:20.634Z    2022-08-17T03:11:20.634Z [ERROR] snapshot: failed to get snapshots: error="open /run/nomad-server/server/raft/snapshots: no such file or directory"
2022-08-17T03:11:20.634Z    2022-08-17T03:11:20.634Z [ERROR] snapshot: failed to scan snapshot directory: error="open /run/nomad-server/server/raft/snapshots: no such file or directory"
2022-08-17T03:11:20.634Z    failed to send snapshot to
2022-08-17T03:11:20.634Z    failed to list snapshots
2022-08-17T03:11:20.634Z    failed to get log
2022-08-17T03:11:20.608Z    failed to list snapshots
2022-08-17T03:11:20.608Z    failed to send snapshot to

And other servers were logging this

2022-08-17T03:07:55.677Z    error waiting for Raft index error=timed out after 5s waiting for index=1525203

So here is my stance:
If the leader is broken because it cannot load the snapshots (I have no idea how we got in this situation but lets ignore that for now); the other server should realize the leader is useless and usurp him; perhaps via invoking the Praetorians Guard.

or more down to Earth: this state should cause a heartbeat failure in some way so that we can move past it and elect a new leader.

What do you think?

The text was updated successfully, but these errors were encountered:

ncabatoff · 2023-06-05T19:27:44Z

Hi @benbuzbee,

I'm not persuaded this is something that ought to be handled in the raft library itself. Moreover, the log you cite ("error waiting for Raft index") doesn't look like something from the library, but from Nomad, so it may be that what you're experiencing isn't purely a raft issue. I suggest you file this proposal as an issue on the https://github.com/hashicorp/nomad repo, and the maintainers of that project can decide whether it's better addressed in Nomad or here in the raft library.

benbuzbee · 2023-06-05T19:35:08Z

If that is where you think this best lives. My suggestion here I think was largely because it is where healthy leadership heart beating exists.

Failure to load the snapshots exists in raft file_snapshot.go. Offhand I am not sure where the re-try loop exists but I suspect it is raft.

Does Nomad actually have what it needs to detect raft failing to load and abort the retries and modify the cluster?

ncabatoff · 2023-06-19T19:25:48Z

Hi @benbuzbee,

I retract what I said earlier: I agree with your original statement

If the leader is broken because it cannot load the snapshots [...] the other server should realize the leader is useless and usurp him

Possible fix: in replicateTo, if we can't load a snapshot, we should step down as leader. The current code specifically doesn't stop replication for this error; it probably should, but there are likely other details we need to consider here.

ncabatoff closed this as completed Jun 5, 2023

ncabatoff reopened this Jun 19, 2023

ncabatoff added the bug label Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If leader can't load snapshot cluster won't recover #522

If leader can't load snapshot cluster won't recover #522

benbuzbee commented Aug 31, 2022

ncabatoff commented Jun 5, 2023

benbuzbee commented Jun 5, 2023

ncabatoff commented Jun 19, 2023

If leader can't load snapshot cluster won't recover #522

If leader can't load snapshot cluster won't recover #522

Comments

benbuzbee commented Aug 31, 2022

ncabatoff commented Jun 5, 2023

benbuzbee commented Jun 5, 2023

ncabatoff commented Jun 19, 2023