Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: persist commit index in LogStore to accelerate recovery #549

Open
banks opened this issue Mar 15, 2023 · 4 comments
Open

Enhancement: persist commit index in LogStore to accelerate recovery #549

banks opened this issue Mar 15, 2023 · 4 comments

Comments

@banks
Copy link
Member

banks commented Mar 15, 2023

Today I realised my mental model of how our Raft library works is subtly wrong.

I assumed that when a node restarts, it loads the most recent snapshot into the FSM and replays committed logs from disk into the FSM before rejoining the cluster.

In fact, we only load from snapshot since currently we don't persist any information about which logs are known to be committed which means that replaying them all might apply a few uncommitted ones from the end of the log.

In practice this works OK in most cases because if the node is elected leader then it will write a new log which as soon as it's committed will trigger applying all the other committed logs in to the FSM too. If it's a follower it doesn't catch up it's FSM until the first AppendEntries RPC from the leader tells it the current commit index in the cluster.

As we consider reducing snapshot frequency with WAL LogStore being able to store lots more logs performantly, this will become more of an issue because it will mean followers are much more "stale" when they start serving requests.

To fix this correctly, we'd need to persist the last known commit index so that we know which prefix of the logs it's safe to apply on startup. We could periodically persist that to the StableStore but that increases amount of disk transactions and trades that off against how recent the index stays. Currently StableStore performance is not an issue because it's only updated infrequently during elections. This would make it much more critical.

An alternative that would be more efficient but a little more complicate would be:

  1. Introduce a new optional interface called something like:
    type CommitTrackingLogStore interface {
        SetCommitIndex(idx uint64) error
        ReadCommitIndex() (uint64, error)
    }
  2. Have Raft detect that the LogStore also implements this just before calling StoreLogs during replication or leader's dispatchLogs, if it does, call SetCommitIndex first with the most recent commitIndex we know
  3. Implementations may store that index in memory until the next StoreLogs call and then write it out on disk along with the log data.
  4. When the LogStore recovers it's state on open, it would recover the last persisted commit index to and load it into memory
  5. When Raft starts up it can restore snapshot like now then...
  6. Check if the LogStore implements CommitTrackingLogStore. If it does, load the commit index from the store and replay all logs from lastSnapshotIndex up to and including the commit index that was persisted, before completing startup.
@otoolep
Copy link
Contributor

otoolep commented Jun 22, 2023

This would be very useful, as rqlite (which uses this library) can also encounter long start up times. Knowing which logs, which are not contained in the latest snapshot, are actually committed, would solve this problem. I could optimize replay at start-up.

Is there anyway to detect which are committed at this time? I may modify my own fork of Raft if needed, until this repo makes your suggested improvement (or something like it). It doesn't to be perfect, even knowing up to, say, 75%, of the what logs were definitely committed would help. So calling SetCommitIndex even every so often would be help.

@otoolep
Copy link
Contributor

otoolep commented Jun 22, 2023

Though I might just add a goroutine to my code, which polls raft.AppliedIndex() every few seconds. It would write it to the disk. It doesn't have to be up-to-date, just enough to make a difference to start-up times, but primarily to make sure I don't go too far in the Raft log during replay, and apply uncommitted entires.

@banks
Copy link
Member Author

banks commented Nov 20, 2023

@otoolep sorry I didn't see your comments here until now.

Yeah what you proposed would work too either inside the library or from the outside. It would be easy enough to trade off how stale things might be on startup vs, how frequently you write to disk but even writing to disk once per second would have minimal impact on modern SSDs but still bound the staleness after start up to just 1 second.

If you do try this please update here so we can hear from your experience. I'd still like the library to handle this eventually and it's not that difficult to do but no idea when we'd be able to prioritize that.

@otoolep
Copy link
Contributor

otoolep commented Nov 21, 2023

Yes, that's what I did for a while -- added a goroutine that wrotes the Applied Index to BoltDB every few seconds. Then, at start-up, rqlite was putting its SQLite database in a "fast-update" mode while applying logs I knew for a fact were Applied (using the index retrieved from BoltDB at startup). Once that was done, rqlite would change its SQLite instance back to more conservative update strategy (which results in slower writes however).

However I've since backed out the change, and rqlite now just runs SQLite in the "fast update" mode (PRAGMA sychronous=off) all the time. I did this because the SQLite docs state that the worst that could happen if rqlite was to crash, with SQLite in this mode, would be a corrupt SQLite database. But since rqlite deletes and rebuilds the entire DB from scratch on restart, this potential for corruption is a non-issue. So I've got fast start-up times now in all circumstances.

TBH I should have run SQLite in PRAGMA sychronous=off from the very start of my project's development, but wasn't aware of it at the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants