Configuration change validation has false positives #80

tbg · 2023-07-10T14:03:07Z

We currently validate configuration changes at proposal time

Lines 1224 to 1260 in 4abd9e9

    
           for i := range m.Entries { 
        
           	e := &m.Entries[i] 
        
           	var cc pb.ConfChangeI 
        
           	if e.Type == pb.EntryConfChange { 
        
           		var ccc pb.ConfChange 
        
           		if err := ccc.Unmarshal(e.Data); err != nil { 
        
           			panic(err) 
        
           		} 
        
           		cc = ccc 
        
           	} else if e.Type == pb.EntryConfChangeV2 { 
        
           		var ccc pb.ConfChangeV2 
        
           		if err := ccc.Unmarshal(e.Data); err != nil { 
        
           			panic(err) 
        
           		} 
        
           		cc = ccc 
        
           	} 
        
           	if cc != nil { 
        
           		alreadyPending := r.pendingConfIndex > r.raftLog.applied 
        
           		alreadyJoint := len(r.prs.Config.Voters[1]) > 0 
        
           		wantsLeaveJoint := len(cc.AsV2().Changes) == 0 
        
           		var refused string 
        
           		if alreadyPending { 
        
           			refused = fmt.Sprintf("possible unapplied conf change at index %d (applied to %d)", r.pendingConfIndex, r.raftLog.applied) 
        
           		} else if alreadyJoint && !wantsLeaveJoint { 
        
           			refused = "must transition out of joint config first" 
        
           		} else if !alreadyJoint && wantsLeaveJoint { 
        
           			refused = "not in joint state; refusing empty conf change" 
        
           		} 
        
           		if refused != "" { 
        
           			r.logger.Infof("%x ignoring conf change %v at config %s: %s", r.id, cc, r.prs.Config, refused) 
        
           			m.Entries[i] = pb.Entry{Type: pb.EntryNormal} 
        
           		} else { 
        
           			r.pendingConfIndex = r.raftLog.lastIndex() + uint64(i) + 1 
        
           		} 
        
           	}

This is not very helpful because it has false positives (refusing a config change that is actually allowed) though at least it currently doesn't have false negatives (because the set of false positives is sufficiently large 😄)

It could be more effective to either compare against the actual most recent config change in the (including the unstable) log, or to move the verification to apply time (i.e. erroring out conf changes that are illegal).

See #81.

Relies on etcd-io/raft#81. We don't want it to, since that causes issues due to false positives. We are taking responsibility for carrying out only valid conf changes, as we always have. See also etcd-io/raft#80. Fixes cockroachdb#105797. Epic: CRDB-25287 Release note (bug fix): under rare circumstances, a replication change could get stuck when proposed near lease/leadership changes (and likely under overload), and the replica circuit breakers could trip. This problem has been addressed. Note to editors: this time it's really addressed (fingers crossed); a previous attempt with an identical release note had to be reverted.

tbg · 2023-07-12T06:35:06Z

in #81 (which adds a DisableConfChangeValidation option), @pavelkalinnikov says:

I think we should fix this lag in raft more fundamentally, in such a way that the DisableConfChangeValidation option is not needed. This mechanism should just not be best-effort, and should not have false positives/negatives. Config changes need some reconsideration.

Relying solely on above-raft config change mechanisms (such as in CRDB) also smells like a footgun. Config changes seem fundamentally belonging to raft layer, at least w.r.t. correctness of quorum rules etc.

Currently we do a collection of tricks which makes correctness reasoning hard:

scan the log to see if we have an unapplied config
guess if we have a pending config change
silently nullify incorrect config changes, or silently accept them (with the
DisableConfChangeValidation option on).

Instead, I think we should just precisely track and know the current and prospective config from the explicit state, and explicitly error out on incorrect transitions. I.e., when we commit a pending config change, it becomes part of the state. To do so, we would need to decouple config changes from logs (otherwise, again, we would have to have a "scan the log" step to find the next config, e.g. when entries are appended or the commit index is advanced).

More generally, it would be nice to have separate state machines: raft itself, and the application-level state machine that it supports.

There are more arguments for this decoupling:

Config changes are the only entries that raft needs to see to operate correctly. All other log entries (which often include user data etc) don't need to go through raft package, it's enough to communicate metadata such as term/index of stored entries.
The application-level commands thus don't have to be embedded into and carried with the protobufs format that raft package imposes.
The application level thus doesn't have to demultiplex raft config changes and its own logs.
Performance reason. Say, there is a leader and 2 slow followers who are close to stall. It is nice to be able to relocate the two followers to other nodes without waiting for them to catch up. Currently we have to commit a config change through this slow quorum's log, but with decoupling it should be possible to do it bypassing the log, with a "higher priority".
raft doesn't look at application-level commands (which is most of them), but plays disproportionate role in managing resources and driving the flow of application-level entries: Externalize log entry flow control #64. The control should be at the application end. Having app-level commands interleaved with config changes complicates this control though, but decoupling should help.

DisableConfChangeValidation turns off propose-time verification of configuration changes against the currently active configuration of the raft instance. These checks are generally sensible (cannot leave a joint config unless in a joint config, et cetera) but they have false positives because the active configuration may not be the most recent configuration. This is because configurations are activated during log application, and even the leader can trail log application by an unbounded number of entries. Symmetrically, the mechanism has false negatives - because the check may not run against the "actual" config that will be the predecessor of the newly proposed one, the check may pass but the new config may be invalid when it is being applied. In other words, the checks are best-effort. Users should *not* use this option unless they have a reliable mechanism (above raft) that serializes and verifies configuration changes. In CockroachDB, we doubly hit the false positive case. In CockroachDB, each proposal (including configuration changes) has a UUID which the leader tracks while the command is inflight. To detect that a command is no longer inflight, the leader has to see it apply (or needs some proof that it will never apply in the future). When a CockroachDB Replica loses leadership in the middle of a configuration change, a new leader may carry out additional changes that now take effect, while the old Replica is still trying to finalize its old, now incompatible, configuration change. What CockroachDB needs is to see the old proposal in the log so that it can be rejected at apply time, thus terminating the inflight status of the old configuration change. But since raft will forward the proposal to the leader where it is rejected (since it's incompatible with the active, much newer, config), this would never happen, thus leaking an inflight[^1]. The other case became obvious when we tried to work around the above by terminating the config change when we noticed that raft rejected it. Because the rejection is not made against a stable basis (it's against whatever config is active at the given time), legitimate configuration changes would be rejected frequently. In particular, this could occur in a way that would still result in the respective entries becoming applied later! This caused corruption in CockroachDB[^2] by leading to a divergence between what we thought the config was and what it actually was. CockroachDB does have a higher-level protection mechanism (configuration changes are serialized through transactions reading from and writing to the same key). The intention is for CockroachDB to use this setting, however it is likely that other systems are susceptible to similar issues (and perhaps unknowingly so). This setting is thus of general interest. Touches #80. [^1]: cockroachdb/cockroach#105797 (comment) [^2]: cockroachdb/cockroach#106172 (comment) Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>

Relies on etcd-io/raft#81. We don't want it to, since that causes issues due to false positives. We are taking responsibility for carrying out only valid conf changes, as we always have. See also etcd-io/raft#80. Fixes cockroachdb#105797. Epic: CRDB-25287 Release note (bug fix): under rare circumstances, a replication change could get stuck when proposed near lease/leadership changes (and likely under overload), and the replica circuit breakers could trip. This problem has been addressed. Note to editors: this time it's really addressed (fingers crossed); a previous attempt with an identical release note had to be reverted.

105368: backupccl: add unit tests for FileSSTSink r=rhu713 a=rhu713 Backfill unit tests for the basic functionality of FileSSTSink with additional test cases involving inputs of keys with many entries in its revision history. Epic: CRDB-27758 Release note: None 105624: jobsprofiler: dump trace recording on job completion r=dt a=adityamaru This change teaches the job resumer to fetch and write its trace recording before finishing its tracing span. These traces will be consumed by the job profiler bundle that is being introduced in #105384. These traces will be valuable in understanding a job's execution characteristics during each resumption, even if the job has reached a terminal state. Currently, this behaviour is opt-in and has been enabled for backups, restore, import and physical replication jobs. Informs: #102794 Release note: None 106515: DEPS: bump across etcd-io/raft#81 and disable conf change validation r=erikgrinaker a=tbg We don't want raft to validate conf changes, since that causes issues due to false positives (the check is above raft, but needs to be below raft to always work correctly). We are taking responsibility for carrying out only valid conf changes, as we always have. See also etcd-io/raft#80. Fixes #105797. Epic: CRDB-25287 Release note (bug fix): under rare circumstances, a replication change could get stuck when proposed near lease/leadership changes (and likely under overload), and the replica circuit breakers could trip. This problem has been addressed. Note to editors: this time it's really addressed (fingers crossed); a previous attempt with an identical release note had to be reverted. 106939: changefeedccl: fix flake in TestParquetRows r=miretskiy a=jayshrivastava changefeedccl: fix flake in TestParquetRows Previously, this test would flake when rows were not emitted in the exact order they were inserted/modified. This change makes the test resilient to different ordering. Epic: None Fixes: #106911 Release note: None --- util/parquet: make metadata transparent in tests Previously, users of the library would need to explicitly call `NewWriterWithReaderMetadata()` to configure the parquet writer to add metadata required to use reader utils in `pkg/util/parquet/testutils.go`. This led to a lot of code uncessary code duplication. This moves the logic to decide if metadata should be written to `NewWriter()` so callers do not need to do the extra work. Epic: None Release note: None Co-authored-by: Rui Hu <rui@cockroachlabs.com> Co-authored-by: adityamaru <adityamaru@gmail.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com>

DisableConfChangeValidation turns off propose-time verification of configuration changes against the currently active configuration of the raft instance. These checks are generally sensible (cannot leave a joint config unless in a joint config, et cetera) but they have false positives because the active configuration may not be the most recent configuration. This is because configurations are activated during log application, and even the leader can trail log application by an unbounded number of entries. Symmetrically, the mechanism has false negatives - because the check may not run against the "actual" config that will be the predecessor of the newly proposed one, the check may pass but the new config may be invalid when it is being applied. In other words, the checks are best-effort. Users should *not* use this option unless they have a reliable mechanism (above raft) that serializes and verifies configuration changes. In CockroachDB, we doubly hit the false positive case. In CockroachDB, each proposal (including configuration changes) has a UUID which the leader tracks while the command is inflight. To detect that a command is no longer inflight, the leader has to see it apply (or needs some proof that it will never apply in the future). When a CockroachDB Replica loses leadership in the middle of a configuration change, a new leader may carry out additional changes that now take effect, while the old Replica is still trying to finalize its old, now incompatible, configuration change. What CockroachDB needs is to see the old proposal in the log so that it can be rejected at apply time, thus terminating the inflight status of the old configuration change. But since raft will forward the proposal to the leader where it is rejected (since it's incompatible with the active, much newer, config), this would never happen, thus leaking an inflight[^1]. The other case became obvious when we tried to work around the above by terminating the config change when we noticed that raft rejected it. Because the rejection is not made against a stable basis (it's against whatever config is active at the given time), legitimate configuration changes would be rejected frequently. In particular, this could occur in a way that would still result in the respective entries becoming applied later! This caused corruption in CockroachDB[^2] by leading to a divergence between what we thought the config was and what it actually was. CockroachDB does have a higher-level protection mechanism (configuration changes are serialized through transactions reading from and writing to the same key). The intention is for CockroachDB to use this setting, however it is likely that other systems are susceptible to similar issues (and perhaps unknowingly so). This setting is thus of general interest. Touches etcd-io#80. [^1]: cockroachdb/cockroach#105797 (comment) [^2]: cockroachdb/cockroach#106172 (comment) Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>

Relies on etcd-io/raft#81. We don't want it to, since that causes issues due to false positives. We are taking responsibility for carrying out only valid conf changes, as we always have. See also etcd-io/raft#80. Fixes cockroachdb#105797. Epic: CRDB-25287 Release note (bug fix): under rare circumstances, a replication change could get stuck when proposed near lease/leadership changes (and likely under overload), and the replica circuit breakers could trip. This problem has been addressed. Note to editors: this time it's really addressed (fingers crossed); a previous attempt with an identical release note had to be reverted.

DisableConfChangeValidation turns off propose-time verification of configuration changes against the currently active configuration of the raft instance. These checks are generally sensible (cannot leave a joint config unless in a joint config, et cetera) but they have false positives because the active configuration may not be the most recent configuration. This is because configurations are activated during log application, and even the leader can trail log application by an unbounded number of entries. Symmetrically, the mechanism has false negatives - because the check may not run against the "actual" config that will be the predecessor of the newly proposed one, the check may pass but the new config may be invalid when it is being applied. In other words, the checks are best-effort. Users should *not* use this option unless they have a reliable mechanism (above raft) that serializes and verifies configuration changes. In CockroachDB, we doubly hit the false positive case. In CockroachDB, each proposal (including configuration changes) has a UUID which the leader tracks while the command is inflight. To detect that a command is no longer inflight, the leader has to see it apply (or needs some proof that it will never apply in the future). When a CockroachDB Replica loses leadership in the middle of a configuration change, a new leader may carry out additional changes that now take effect, while the old Replica is still trying to finalize its old, now incompatible, configuration change. What CockroachDB needs is to see the old proposal in the log so that it can be rejected at apply time, thus terminating the inflight status of the old configuration change. But since raft will forward the proposal to the leader where it is rejected (since it's incompatible with the active, much newer, config), this would never happen, thus leaking an inflight[^1]. The other case became obvious when we tried to work around the above by terminating the config change when we noticed that raft rejected it. Because the rejection is not made against a stable basis (it's against whatever config is active at the given time), legitimate configuration changes would be rejected frequently. In particular, this could occur in a way that would still result in the respective entries becoming applied later! This caused corruption in CockroachDB[^2] by leading to a divergence between what we thought the config was and what it actually was. CockroachDB does have a higher-level protection mechanism (configuration changes are serialized through transactions reading from and writing to the same key). The intention is for CockroachDB to use this setting, however it is likely that other systems are susceptible to similar issues (and perhaps unknowingly so). This setting is thus of general interest. Touches etcd-io/raft#80. [^1]: cockroachdb/cockroach#105797 (comment) [^2]: cockroachdb/cockroach#106172 (comment) Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>

Relies on etcd-io/raft#81. We don't want it to, since that causes issues due to false positives. We are taking responsibility for carrying out only valid conf changes, as we always have. See also etcd-io/raft#80. Fixes cockroachdb#105797. Epic: CRDB-25287 Release note (bug fix): under rare circumstances, a replication change could get stuck when proposed near lease/leadership changes (and likely under overload), and the replica circuit breakers could trip. This problem has been addressed. Note to editors: this time it's really addressed (fingers crossed); a previous attempt with an identical release note had to be reverted.

tbg changed the title ~~Configuration change validation is unsound~~ Configuration change validation has false positives Jul 10, 2023

tbg mentioned this issue Jul 12, 2023

Add Config.DisableConfChangeValidation #81

Merged

tbg mentioned this issue Jul 17, 2023

DEPS: bump across etcd-io/raft#81 and disable conf change validation cockroachdb/cockroach#106515

Merged

tbg mentioned this issue Jul 18, 2023

release-23.1: DEPS: bump across etcd-io/raft#81 and disable conf change validation cockroachdb/cockroach#107059

Merged

tbg mentioned this issue Jul 18, 2023

release-22.2: DEPS: bump across etcd-io/raft#81 and disable conf change validation cockroachdb/cockroach#107063

Merged

ahrtr added this to the v4.0.0 milestone Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration change validation has false positives #80

Configuration change validation has false positives #80

tbg commented Jul 10, 2023 •

edited

tbg commented Jul 12, 2023

Configuration change validation has false positives #80

Configuration change validation has false positives #80

Comments

tbg commented Jul 10, 2023 • edited

tbg commented Jul 12, 2023

tbg commented Jul 10, 2023 •

edited