raft: advance commit index safely #139

pav-kv · 2024-01-26T02:29:26Z

This change makes the commit index advancement in handleHeartbeat safe. Previously, a follower would attempt to update the commit index to whichever was sent in the MsgHeartbeat message. Out-of-bound indices would crash the node.

It is always safe to advance a commit index if the follower's log is "in sync" with the leader, i.e. when its log is guaranteed to be a prefix of the leader's log. This is always true if the term of last entry in the log matches the leader team, otherwise this guarantee is established when the first MsgApp append message from the leader succeeds.

At the moment, the leader will never send a commit index that exceeds the follower's log size. However, this may change in future. This change is a defence-in-depth.

The newly added raftLog.leaderTerm field will be used for other safety checks in the future, for example to establish that overriding a suffix of entries in raftLog is safe.

Informs #138
Related to #144

pav-kv · 2024-01-26T02:45:03Z

@ahrtr @nvanbenschoten PTAL

raft.go

pav-kv · 2024-01-26T04:44:39Z

raft_test.go

 	}

 	for i, tt := range tests {
 		storage := newTestMemoryStorage(withPeers(1, 2))
-		storage.Append([]pb.Entry{{Index: 1, Term: 1}, {Index: 2, Term: 2}, {Index: 3, Term: 3}})


This test was incorrect previously. The third entry term is 3, while the leader Term is 2. It was incorrect to update the commit index to 3 in this case, because the leader 2 would never commit this entry.

This basically demonstrates that we fixed the unsafety. Now it's impossible for the follower to advance the commit index to 3 in this case.

This change makes the commit index advancement in handleHeartbeat safe. Previously, a follower would attempt to update the commit index to whichever was sent in the MsgHeartbeat message. Out-of-bound indices would crash the node. It is always safe to advance a commit index if the follower's log is "in sync" with the leader, i.e. when its log is guaranteed to be a prefix of the leader's log. This is always true if the term of last entry in the log matches the leader team, otherwise this guarantee is established when the first MsgApp append message from the leader succeeds. At the moment, the leader will never send a commit index that exceeds the follower's log size. However, this may change in future. This change is a defence-in-depth. The newly added raftLog.leaderTerm field will be used for other safety checks in the future, for example to establish that overriding a suffix of entries in raftLog is safe. Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>

nvanbenschoten · 2024-01-29T14:53:38Z

log.go

+	// does for the leader. For followers and candidates, when we first learn or
+	// bump to a new term, we don't have a proof that our log is consistent with
+	// the new term's leader (current or prospective). The new leader may override
+	// any suffix of the log after the committed index. Only when the first append


"the log", "the committed index"

Are you referring to the replicated log here, or this replica's local log? In other words, should these "the"s be replaced by "our"?

Both, but mostly the "replicated". The leader can override only a suffix of the "replicated" log after the "replicated" commit index.

"Our" log is lagging the leader's log, such as our committed index. For "our" log, this implies that the leader can override a suffix after our committed index (but we have no way of checking by how much it can go back).

nvanbenschoten · 2024-01-29T14:56:10Z

log.go

+	// log is guaranteed to be a prefix of this term's leader log.
+	//
+	// The leaderTerm can be safely updated to `t` if:
+	//	1. the last entry in the log has term `t`, or, more generally,


Are there cases where (2) is insufficient and (1) is needed? I'm curious why we can't make this more specific. You say "We use (1) to initialize leaderTerm" below. Is this important on startup?

Only (2) is necessary. (1) is a cheap "stateful" version of (2).

On a server restart, we forget who did the last append. (1) gives the best guess, and allows recovering to up-to-date state for idle raft groups that haven't seen recent appends except the current leader's empty entry. For non-idle groups, or followers whose log is significantly behind the leader's and doesn't end with entries at its term, recovering to raftLog.lastTerm() gives no value, and is equivalent to setting leaderTerm = 0. To recover, these will then have to "wait" for the next append message from this leader.

It would be more ideal if this field was stored in some HardState.LeaderTerm - then we would always recover to up-to-date state. Note that we can't reuse HardState.Term because HardState.Term can be > leaderTerm (for the same reason why r.Term can be > leaderTerm).

To bring analogy with Paxos, the local raftLog is an acceptor, and raftLog.leaderTerm is the id of the highest accepted proposal. The election term is sort of orthogonal to this - the election term (r.Term / HardState.Term) can briefly jump in front of the accepted proposal term until there is an accepted proposal at this new term.

If we ever want to bring this to the next level: MsgApp messages should not be rejected based on r.Term / HardState.Term. For correctness, it is only necessary to reject MsgApp if the message term is < raftLog.leaderTerm / HardState.LeaderTerm. I think this would reduce some unnecessary rejects during the leader election flux time. caveat: #139 (comment)

I should note the invariant here, and maybe check it in a few places or tests:

RawNode.raft.Term >= raftLog.leaderTerm >= raftLog.lastTerm()

(1) initializes leaderTerm to lastTerm(), which is safe because raft.Term >= raftLog.lastTerm()
(2) maintains it

allows recovering to up-to-date state for idle raft groups that haven't seen recent appends except the current leader's empty entry. For non-idle groups, or followers whose log is significantly behind the leader's and doesn't end with entries at its term, recovering to raftLog.lastTerm() gives no value, and is equivalent to setting leaderTerm = 0. To recover, these will then have to "wait" for the next append message from this leader.

This is what I was hoping to clarify. Initializing to lastTerm is an opportunistic way to allow a restarted follower in idle raft groups to immediately advance its commit index on startup without needing to first accept a MsgApp.

Is it anything more than that? For an idle raft group, a follower with an up-to-date log but a lagging commit index may restart and never receive any new entries. If we didn't have (1) and we started discarding commit indexes in heartbeats with terms > raftLog.leaderTerm, would the commit index on the follower get stuck?

That's a good point. I think we indeed need (1). Also, (1) plays nicely with the invariant that I put above.

If we ever want to bring this to the next level: MsgApp messages should not be rejected based on r.Term / HardState.Term. For correctness, it is only necessary to reject MsgApp if the message term is < raftLog.leaderTerm / HardState.LeaderTerm. I think this would reduce some unnecessary rejects during the leader election flux time.

Does this mean MsgApp from old leader will be accepted by voters that voted to a new leader but yet received any log from the new leader? In that case, there may be committed entry in old leader but not in the new leader.

@joshuazh-x That's a good point. We don't want a quorum of nodes to accept entries unless we are sure these entries are consistent with the new leader's log.

So it's safest to accept only MsgApp.Term >= raft.Term. But we could sometimes accept MsgApp.Term < raft.Term if:

The MsgApp contains the (index, term) entry for which we voted when electing the raft.Term leader. If that election wins, we know the new leader will append right after this entry.

We would truncate MsgApp.Entries at the aforementioned index we voted for, and append it.

This guarantees that this append is consistent with the new leader's log (i.e. the new leader would send us the exact same entries).

A vote is a promise to the leader not to accept any entries that are not in the leader's log. If we can deduce that an entry is in the leader's log (before / other than by getting a MsgApp directly from this leader), we can always safely accept it.

It's unclear if such an optimization would give any value (like reduce replication latency in some cases; probably it does avoid a duplicate MsgApp from the new leader when the election races with the old leader appends), so I will leave it as an exercise for later :) Looks like a complication.

Filed #150 with a more general technique that will bring more benefits.

log.go

nvanbenschoten · 2024-01-29T14:59:22Z

raft.go

@@ -935,6 +935,8 @@ func (r *raft) becomeLeader() {
 	// so the preceding log append does not count against the uncommitted log
 	// quota of the new leader. In other words, after the call to appendEntry,
 	// r.uncommittedSize is still 0.
+
+	r.raftLog.leaderTerm = r.Term // the leader's log is consistent with itself


nit: is is common to manipulate fields in raftLog directly? Should we tuck this behind a method? Same question below.

Unfortunately, the direct manipulation on raftLog and Progress fields is scattered across the raft.go too. This goes into the TODO above: all the correctness handling should be done at the level below.

In this particular case/PR though, we can move the leaderTerm update into the r.appendEntry call a few lines above, which in turn will delegate this to r.raftLog.append method. This will be more correct, I agree.

Done this (see second commit).

Unfortunately, too many things have incorrect interfaces: the raftLog append and commit methods don't take the leader term into consideration. This lead to some tests testing scenarios which can never happen.

Maybe it would be good to reverse the order of commits here: first clean up the raftLog interfaces, and then this PR will be a small change.

nvanbenschoten · 2024-01-29T15:02:57Z

raft.go

+	// TODO(pav-kv): move this logic to r.raftLog, which is more appropriate for
+	// handling safety. The raftLog can use leaderTerm for other safety checks.
+	// For example, unstable.truncateAndAppend currently may override a suffix of
+	// the log unconditionally, but it can only be done if m.Term > leaderTerm.


I'm +1 on these additional safety checks / assertions, but they pose potential availability risk if we get them wrong and start rejecting valid state transitions. Should we make a habit of adding logging in the cases where we drop messages, so that any bugs are observable? For example, logging on an else branch here about how we're ignoring the commit index.

Otherwise, entries at this index may mismatch.

Did you ever see a real issue or can you create a test to reproduce the issue (mismatch)?

@ahrtr Currently, this will not occur because the leader cuts the commit index at the follower's Progress.Match. This is explained in #138. The solution to #138 is exactly to start sending commit indices that may be out-of-bound (but handling them correctly here on the follower).

I can confirm, however, that in a test environment in CRDB I've seen out-of-bound commit indices in this line, triggering panic in r.raftLog.commitTo. This happened in a particular kind of tests in which writes on the leader or other followers are sometimes not durable. It's bad to crash this follower only because writes on other participants are not durable - we should handle it more gracefully. Esp. in CRDB case when multiple raft groups are hosted by the same process.

This PR strengthens this code so that it handles out-of-bound indices correctly, and proceeds only if it's safe.

Re @nvanbenschoten comment about logging: I don't know if we should log dropped messages. I think we should only be reporting safety properties violation - that means there is a bug in raft or some state corruption. For cases (like this) when dropped messages can be a direct result of asynchrony / races / distributed nature of the system - it should be ok to silently disregard them (moreover, raft is built with the assumption that messages can be dropped arbitrarily).

I don't expect we would be dropping any messages here currently (except in the test environment case that I described above).

On a second thought, we might opt to distinguish legit message drops vs safety violations. Safety violations should crash offending nodes, or have other means of appearing on radars (e.g. a raft group can be "bricked", and this can surface in monitoring). See #18.
Legit messages drops, OTOH, don't have to be loud, but we can still log them for informational purposes (and, indeed, in cases of liveness bugs logs can help with debugging).

pav-kv · 2024-01-29T16:01:09Z

log.go

+	// log is guaranteed to be a prefix of this term's leader log.
+	//
+	// The leaderTerm can be safely updated to `t` if:
+	//	1. the last entry in the log has term `t`, or, more generally,


Only (2) is necessary. (1) is a cheap "stateful" version of (2).

On a server restart, we forget who did the last append. (1) gives the best guess, and allows recovering to up-to-date state for idle raft groups that haven't seen recent appends except the current leader's empty entry. For non-idle groups, or followers whose log is significantly behind the leader's and doesn't end with entries at its term, recovering to raftLog.lastTerm() gives no value, and is equivalent to setting leaderTerm = 0. To recover, these will then have to "wait" for the next append message from this leader.

It would be more ideal if this field was stored in some HardState.LeaderTerm - then we would always recover to up-to-date state. Note that we can't reuse HardState.Term because HardState.Term can be > leaderTerm (for the same reason why r.Term can be > leaderTerm).

To bring analogy with Paxos, the local raftLog is an acceptor, and raftLog.leaderTerm is the id of the highest accepted proposal. The election term is sort of orthogonal to this - the election term (r.Term / HardState.Term) can briefly jump in front of the accepted proposal term until there is an accepted proposal at this new term.

If we ever want to bring this to the next level: MsgApp messages should not be rejected based on r.Term / HardState.Term. For correctness, it is only necessary to reject MsgApp if the message term is < raftLog.leaderTerm / HardState.LeaderTerm. I think this would reduce some unnecessary rejects during the leader election flux time. caveat: #139 (comment)

pav-kv · 2024-01-29T16:09:46Z

log.go

+	// does for the leader. For followers and candidates, when we first learn or
+	// bump to a new term, we don't have a proof that our log is consistent with
+	// the new term's leader (current or prospective). The new leader may override
+	// any suffix of the log after the committed index. Only when the first append


Both, but mostly the "replicated". The leader can override only a suffix of the "replicated" log after the "replicated" commit index.

"Our" log is lagging the leader's log, such as our committed index. For "our" log, this implies that the leader can override a suffix after our committed index (but we have no way of checking by how much it can go back).

pav-kv · 2024-01-29T16:13:37Z

raft.go

@@ -935,6 +935,8 @@ func (r *raft) becomeLeader() {
 	// so the preceding log append does not count against the uncommitted log
 	// quota of the new leader. In other words, after the call to appendEntry,
 	// r.uncommittedSize is still 0.
+
+	r.raftLog.leaderTerm = r.Term // the leader's log is consistent with itself


Unfortunately, the direct manipulation on raftLog and Progress fields is scattered across the raft.go too. This goes into the TODO above: all the correctness handling should be done at the level below.

In this particular case/PR though, we can move the leaderTerm update into the r.appendEntry call a few lines above, which in turn will delegate this to r.raftLog.append method. This will be more correct, I agree.

pav-kv · 2024-01-29T16:15:52Z

raft.go

@@ -1735,6 +1737,7 @@ func (r *raft) handleAppendEntries(m pb.Message) {
 		return
 	}
 	if mlastIndex, ok := r.raftLog.maybeAppend(m.Index, m.LogTerm, m.Commit, m.Entries...); ok {
+		r.raftLog.leaderTerm = m.Term // the log is now consistent with the leader


@nvanbenschoten This should move to r.raftLog.maybeAppend, relatedly to your other comment.

Also done in the second commit.

ahrtr · 2024-01-29T17:21:52Z

log.go

+	// During normal operation, leaderTerm matches the node term though. During a
+	// leader change, it briefly lags behind, and matches again when the first
+	// append message succeeds.
+	leaderTerm uint64


My immediate feeling is that it's a little weird to add a leaderTerm into raftLog, which shouldn't care about the info (leader's Term). It should be part of the raft instead of raftLog.

See comment https://github.com/etcd-io/raft/pull/139/files#r1469818174, specifically the Paxos analogy.

Semantically, raftLog is layered under raft.go. The raftLog is agnostic to election, and essentially only implements the "Paxos acceptor" role. So it makes quite a good sense to me to have a "last accepted term" notion in raftLog. Moving this logic up in raft.go would be a layering violation of sorts.

which shouldn't care about the info (leader's Term)

raftLog should care about the leader term with which it is consistent (i.e. what I called the leaderTerm here). It should not care about the RawNode.Term / HardState.Term, because the latter can be bumped arbitrarily when a new election happens, and this bump is not coordinated with raftLog.

There is an invariant at raft.go level that is relevant here: RawNode.Term >= raftLog.leaderTerm. We use RawNode.Term instead of raftLog.leaderTerm in a few places (notably, to reject append messages), and have only been safe because of this implicit invariant. I think we should make it more explicit.

This change improves safety of the append operations on raftLog. This helped fixing some tests which made incorrect assumptions about the log. Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>

pav-kv · 2024-02-04T03:31:05Z

This PR is currently blocked on a clean-up started in #145. Once the logSlice type is propagated to unstable, this PR will become nice and simple. Degrading it to Draft for now.

pav-kv requested a review from ahrtr January 26, 2024 02:30

pav-kv force-pushed the commit-index-safety branch from 2584c41 to eb6bfc6 Compare January 26, 2024 02:42

This was referenced Jan 26, 2024

Plan to release raft 3.6.0 #89

Open

Lagging follower commit after message drops #138

Open

pav-kv commented Jan 26, 2024

View reviewed changes

raft.go Outdated Show resolved Hide resolved

pav-kv force-pushed the commit-index-safety branch 4 times, most recently from f727406 to f64d156 Compare January 26, 2024 04:38

pav-kv commented Jan 26, 2024

View reviewed changes

pav-kv force-pushed the commit-index-safety branch 4 times, most recently from 30f9d89 to 73c85a0 Compare January 27, 2024 17:17

pav-kv force-pushed the commit-index-safety branch from 73c85a0 to 87ac09e Compare January 27, 2024 17:37

nvanbenschoten reviewed Jan 29, 2024

View reviewed changes

pav-kv commented Jan 29, 2024

View reviewed changes

ahrtr reviewed Jan 29, 2024

View reviewed changes

raft: pass the leader term to append/commit methods

4e08f52

This change improves safety of the append operations on raftLog. This helped fixing some tests which made incorrect assumptions about the log. Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>

This was referenced Jan 30, 2024

raft: add entryID and logSlice types #145

Merged

log: fixup raftLog initialization #151

Merged

pav-kv marked this pull request as draft February 4, 2024 03:30

k8s-ci-robot added the do-not-merge/work-in-progress label Feb 4, 2024

pav-kv mentioned this pull request Feb 13, 2024

raft: consolidate all append message sending #134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: advance commit index safely #139

raft: advance commit index safely #139

pav-kv commented Jan 26, 2024 •

edited

pav-kv commented Jan 26, 2024

pav-kv Jan 26, 2024 •

edited

nvanbenschoten Jan 29, 2024

pav-kv Jan 29, 2024 •

edited

nvanbenschoten Jan 29, 2024

pav-kv Jan 29, 2024 •

edited

pav-kv Jan 30, 2024 •

edited

nvanbenschoten Jan 30, 2024

pav-kv Jan 30, 2024

joshuazh-x Feb 1, 2024

pav-kv Feb 1, 2024 •

edited

nvanbenschoten Jan 29, 2024

pav-kv Jan 29, 2024

pav-kv Jan 29, 2024

nvanbenschoten Jan 29, 2024

ahrtr Jan 29, 2024

pav-kv Jan 29, 2024 •

edited

pav-kv Jan 29, 2024

pav-kv Jan 29, 2024 •

edited

pav-kv Jan 29, 2024 •

edited

pav-kv Jan 29, 2024

pav-kv Jan 29, 2024

pav-kv Jan 29, 2024

ahrtr Jan 29, 2024

pav-kv Jan 29, 2024 •

edited

pav-kv commented Feb 4, 2024

raft: advance commit index safely #139

Are you sure you want to change the base?

raft: advance commit index safely #139

Conversation

pav-kv commented Jan 26, 2024 • edited

pav-kv commented Jan 26, 2024

pav-kv Jan 26, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pav-kv Jan 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pav-kv Jan 29, 2024 • edited

Choose a reason for hiding this comment

pav-kv Jan 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pav-kv Feb 1, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pav-kv Jan 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pav-kv Jan 29, 2024 • edited

Choose a reason for hiding this comment

pav-kv Jan 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pav-kv Jan 29, 2024 • edited

Choose a reason for hiding this comment

pav-kv commented Feb 4, 2024

pav-kv commented Jan 26, 2024 •

edited

pav-kv Jan 26, 2024 •

edited

pav-kv Jan 29, 2024 •

edited

pav-kv Jan 29, 2024 •

edited

pav-kv Jan 30, 2024 •

edited

pav-kv Feb 1, 2024 •

edited

pav-kv Jan 29, 2024 •

edited

pav-kv Jan 29, 2024 •

edited

pav-kv Jan 29, 2024 •

edited

pav-kv Jan 29, 2024 •

edited