Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration change validation has false positives #80

Open
tbg opened this issue Jul 10, 2023 · 1 comment
Open

Configuration change validation has false positives #80

tbg opened this issue Jul 10, 2023 · 1 comment
Milestone

Comments

@tbg
Copy link
Collaborator

tbg commented Jul 10, 2023

We currently validate configuration changes at proposal time

raft/raft.go

Lines 1224 to 1260 in 4abd9e9

for i := range m.Entries {
e := &m.Entries[i]
var cc pb.ConfChangeI
if e.Type == pb.EntryConfChange {
var ccc pb.ConfChange
if err := ccc.Unmarshal(e.Data); err != nil {
panic(err)
}
cc = ccc
} else if e.Type == pb.EntryConfChangeV2 {
var ccc pb.ConfChangeV2
if err := ccc.Unmarshal(e.Data); err != nil {
panic(err)
}
cc = ccc
}
if cc != nil {
alreadyPending := r.pendingConfIndex > r.raftLog.applied
alreadyJoint := len(r.prs.Config.Voters[1]) > 0
wantsLeaveJoint := len(cc.AsV2().Changes) == 0
var refused string
if alreadyPending {
refused = fmt.Sprintf("possible unapplied conf change at index %d (applied to %d)", r.pendingConfIndex, r.raftLog.applied)
} else if alreadyJoint && !wantsLeaveJoint {
refused = "must transition out of joint config first"
} else if !alreadyJoint && wantsLeaveJoint {
refused = "not in joint state; refusing empty conf change"
}
if refused != "" {
r.logger.Infof("%x ignoring conf change %v at config %s: %s", r.id, cc, r.prs.Config, refused)
m.Entries[i] = pb.Entry{Type: pb.EntryNormal}
} else {
r.pendingConfIndex = r.raftLog.lastIndex() + uint64(i) + 1
}
}

This is not very helpful because it has false positives (refusing a config change that is actually allowed) though at least it currently doesn't have false negatives (because the set of false positives is sufficiently large 😄)

It could be more effective to either compare against the actual most recent config change in the (including the unstable) log, or to move the verification to apply time (i.e. erroring out conf changes that are illegal).

See #81.

@tbg tbg changed the title Configuration change validation is unsound Configuration change validation has false positives Jul 10, 2023
tbg added a commit to tbg/cockroach that referenced this issue Jul 11, 2023
Relies on etcd-io/raft#81.

We don't want it to, since that causes issues due to false positives. We are
taking responsibility for carrying out only valid conf changes, as we always
have.

See also etcd-io/raft#80.

Fixes cockroachdb#105797.

Epic: CRDB-25287
Release note (bug fix): under rare circumstances, a replication change could get
stuck when proposed near lease/leadership changes (and likely under overload),
and the replica circuit breakers could trip. This problem has been addressed.
Note to editors: this time it's really addressed (fingers crossed); a previous
attempt with an identical release note had to be reverted.
@tbg
Copy link
Collaborator Author

tbg commented Jul 12, 2023

in #81 (which adds a DisableConfChangeValidation option), @pavelkalinnikov says:


I think we should fix this lag in raft more fundamentally, in such a way that the DisableConfChangeValidation option is not needed. This mechanism should just not be best-effort, and should not have false positives/negatives. Config changes need some reconsideration.

Relying solely on above-raft config change mechanisms (such as in CRDB) also smells like a footgun. Config changes seem fundamentally belonging to raft layer, at least w.r.t. correctness of quorum rules etc.

Currently we do a collection of tricks which makes correctness reasoning hard:

  • scan the log to see if we have an unapplied config
  • guess if we have a pending config change
  • silently nullify incorrect config changes, or silently accept them (with the
    DisableConfChangeValidation option on).

Instead, I think we should just precisely track and know the current and prospective config from the explicit state, and explicitly error out on incorrect transitions. I.e., when we commit a pending config change, it becomes part of the state. To do so, we would need to decouple config changes from logs (otherwise, again, we would have to have a "scan the log" step to find the next config, e.g. when entries are appended or the commit index is advanced).

More generally, it would be nice to have separate state machines: raft itself, and the application-level state machine that it supports.

There are more arguments for this decoupling:

  • Config changes are the only entries that raft needs to see to operate correctly. All other log entries (which often include user data etc) don't need to go through raft package, it's enough to communicate metadata such as term/index of stored entries.
  • The application-level commands thus don't have to be embedded into and carried with the protobufs format that raft package imposes.
  • The application level thus doesn't have to demultiplex raft config changes and its own logs.
  • Performance reason. Say, there is a leader and 2 slow followers who are close to stall. It is nice to be able to relocate the two followers to other nodes without waiting for them to catch up. Currently we have to commit a config change through this slow quorum's log, but with decoupling it should be possible to do it bypassing the log, with a "higher priority".
  • raft doesn't look at application-level commands (which is most of them), but plays disproportionate role in managing resources and driving the flow of application-level entries: Externalize log entry flow control #64. The control should be at the application end. Having app-level commands interleaved with config changes complicates this control though, but decoupling should help.

tbg added a commit that referenced this issue Jul 17, 2023
DisableConfChangeValidation turns off propose-time verification of
configuration changes against the currently active configuration of the
raft instance. These checks are generally sensible (cannot leave a joint
config unless in a joint config, et cetera) but they have false positives
because the active configuration may not be the most recent
configuration. This is because configurations are activated during log
application, and even the leader can trail log application by an
unbounded number of entries.
Symmetrically, the mechanism has false negatives - because the check may
not run against the "actual" config that will be the predecessor of the
newly proposed one, the check may pass but the new config may be invalid
when it is being applied. In other words, the checks are best-effort.

Users should *not* use this option unless they have a reliable mechanism
(above raft) that serializes and verifies configuration changes.

In CockroachDB, we doubly hit the false positive case.

In CockroachDB, each proposal (including configuration changes) has a UUID
which the leader tracks while the command is inflight. To detect that a command
is no longer inflight, the leader has to see it apply (or needs some proof that
it will never apply in the future). When a CockroachDB Replica loses leadership
in the middle of a configuration change, a new leader may carry out additional
changes that now take effect, while the old Replica is still trying to finalize
its old, now incompatible, configuration change. What CockroachDB needs is to
see the old proposal in the log so that it can be rejected at apply time, thus
terminating the inflight status of the old configuration change. But since raft
will forward the proposal to the leader where it is rejected (since it's
incompatible with the active, much newer, config), this would never happen,
thus leaking an inflight[^1].

The other case became obvious when we tried to work around the above by
terminating the config change when we noticed that raft rejected it. Because
the rejection is not made against a stable basis (it's against whatever config
is active at the given time), legitimate configuration changes would be
rejected frequently. In particular, this could occur in a way that would still
result in the respective entries becoming applied later! This caused corruption
in CockroachDB[^2] by leading to a divergence between what we thought the config
was and what it actually was.

CockroachDB does have a higher-level protection mechanism (configuration
changes are serialized through transactions reading from and writing to the
same key). The intention is for CockroachDB to use this setting, however it is
likely that other systems are susceptible to similar issues (and perhaps
unknowingly so). This setting is thus of general interest.

Touches #80.

[^1]: cockroachdb/cockroach#105797 (comment)
[^2]: cockroachdb/cockroach#106172 (comment)

Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
tbg added a commit to tbg/cockroach that referenced this issue Jul 17, 2023
Relies on etcd-io/raft#81.

We don't want it to, since that causes issues due to false positives. We are
taking responsibility for carrying out only valid conf changes, as we always
have.

See also etcd-io/raft#80.

Fixes cockroachdb#105797.
Epic: CRDB-25287
Release note (bug fix): under rare circumstances, a replication change could get
stuck when proposed near lease/leadership changes (and likely under overload),
and the replica circuit breakers could trip. This problem has been addressed.
Note to editors: this time it's really addressed (fingers crossed); a previous
attempt with an identical release note had to be reverted.
craig bot pushed a commit to cockroachdb/cockroach that referenced this issue Jul 17, 2023
105368: backupccl: add unit tests for FileSSTSink r=rhu713 a=rhu713

Backfill unit tests for the basic functionality of FileSSTSink with additional test cases involving inputs of keys with many entries in its revision history.

Epic: CRDB-27758

Release note: None

105624: jobsprofiler: dump trace recording on job completion r=dt a=adityamaru

This change teaches the job resumer to fetch and write its trace recording before finishing its tracing span. These traces will be consumed by the job profiler bundle that is being introduced in #105384. These traces will be valuable in understanding a job's execution characteristics during each resumption, even if the job has reached a terminal state.

Currently, this behaviour is opt-in and has been enabled for backups, restore, import and physical replication jobs.

Informs: #102794
Release note: None

106515: DEPS: bump across etcd-io/raft#81 and disable conf change validation r=erikgrinaker a=tbg

We don't want raft to validate conf changes, since that causes issues due to false positives (the check is above raft, but needs to be below raft to always work correctly). We are
taking responsibility for carrying out only valid conf changes, as we always
have.

See also etcd-io/raft#80.

Fixes #105797.
Epic: CRDB-25287
Release note (bug fix): under rare circumstances, a replication change could get
stuck when proposed near lease/leadership changes (and likely under overload),
and the replica circuit breakers could trip. This problem has been addressed.
Note to editors: this time it's really addressed (fingers crossed); a previous
attempt with an identical release note had to be reverted.


106939: changefeedccl: fix flake in TestParquetRows r=miretskiy a=jayshrivastava

changefeedccl: fix flake in TestParquetRows
Previously, this test would flake when rows were not emitted
in the exact order they were inserted/modified. This change
makes the test resilient to different ordering.

Epic: None
Fixes: #106911
Release note: None

---

util/parquet: make metadata transparent in tests
Previously, users of the library would need to explicitly
call `NewWriterWithReaderMetadata()` to configure the parquet
writer to add metadata required to use reader utils in
`pkg/util/parquet/testutils.go`. This led to a lot of code
uncessary code duplication. This moves the logic to decide if
metadata should be written to `NewWriter()` so callers do not
need to do the extra work.

Epic: None
Release note: None

Co-authored-by: Rui Hu <rui@cockroachlabs.com>
Co-authored-by: adityamaru <adityamaru@gmail.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com>
tbg added a commit to cockroachdb/raft that referenced this issue Jul 18, 2023
DisableConfChangeValidation turns off propose-time verification of
configuration changes against the currently active configuration of the
raft instance. These checks are generally sensible (cannot leave a joint
config unless in a joint config, et cetera) but they have false positives
because the active configuration may not be the most recent
configuration. This is because configurations are activated during log
application, and even the leader can trail log application by an
unbounded number of entries.
Symmetrically, the mechanism has false negatives - because the check may
not run against the "actual" config that will be the predecessor of the
newly proposed one, the check may pass but the new config may be invalid
when it is being applied. In other words, the checks are best-effort.

Users should *not* use this option unless they have a reliable mechanism
(above raft) that serializes and verifies configuration changes.

In CockroachDB, we doubly hit the false positive case.

In CockroachDB, each proposal (including configuration changes) has a UUID
which the leader tracks while the command is inflight. To detect that a command
is no longer inflight, the leader has to see it apply (or needs some proof that
it will never apply in the future). When a CockroachDB Replica loses leadership
in the middle of a configuration change, a new leader may carry out additional
changes that now take effect, while the old Replica is still trying to finalize
its old, now incompatible, configuration change. What CockroachDB needs is to
see the old proposal in the log so that it can be rejected at apply time, thus
terminating the inflight status of the old configuration change. But since raft
will forward the proposal to the leader where it is rejected (since it's
incompatible with the active, much newer, config), this would never happen,
thus leaking an inflight[^1].

The other case became obvious when we tried to work around the above by
terminating the config change when we noticed that raft rejected it. Because
the rejection is not made against a stable basis (it's against whatever config
is active at the given time), legitimate configuration changes would be
rejected frequently. In particular, this could occur in a way that would still
result in the respective entries becoming applied later! This caused corruption
in CockroachDB[^2] by leading to a divergence between what we thought the config
was and what it actually was.

CockroachDB does have a higher-level protection mechanism (configuration
changes are serialized through transactions reading from and writing to the
same key). The intention is for CockroachDB to use this setting, however it is
likely that other systems are susceptible to similar issues (and perhaps
unknowingly so). This setting is thus of general interest.

Touches etcd-io#80.

[^1]: cockroachdb/cockroach#105797 (comment)
[^2]: cockroachdb/cockroach#106172 (comment)

Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
tbg added a commit to tbg/cockroach that referenced this issue Jul 18, 2023
Relies on etcd-io/raft#81.

We don't want it to, since that causes issues due to false positives. We are
taking responsibility for carrying out only valid conf changes, as we always
have.

See also etcd-io/raft#80.

Fixes cockroachdb#105797.
Epic: CRDB-25287
Release note (bug fix): under rare circumstances, a replication change could get
stuck when proposed near lease/leadership changes (and likely under overload),
and the replica circuit breakers could trip. This problem has been addressed.
Note to editors: this time it's really addressed (fingers crossed); a previous
attempt with an identical release note had to be reverted.
tbg added a commit to cockroachdb/etcd that referenced this issue Jul 18, 2023
DisableConfChangeValidation turns off propose-time verification of
configuration changes against the currently active configuration of the
raft instance. These checks are generally sensible (cannot leave a joint
config unless in a joint config, et cetera) but they have false positives
because the active configuration may not be the most recent
configuration. This is because configurations are activated during log
application, and even the leader can trail log application by an
unbounded number of entries.
Symmetrically, the mechanism has false negatives - because the check may
not run against the "actual" config that will be the predecessor of the
newly proposed one, the check may pass but the new config may be invalid
when it is being applied. In other words, the checks are best-effort.

Users should *not* use this option unless they have a reliable mechanism
(above raft) that serializes and verifies configuration changes.

In CockroachDB, we doubly hit the false positive case.

In CockroachDB, each proposal (including configuration changes) has a UUID
which the leader tracks while the command is inflight. To detect that a command
is no longer inflight, the leader has to see it apply (or needs some proof that
it will never apply in the future). When a CockroachDB Replica loses leadership
in the middle of a configuration change, a new leader may carry out additional
changes that now take effect, while the old Replica is still trying to finalize
its old, now incompatible, configuration change. What CockroachDB needs is to
see the old proposal in the log so that it can be rejected at apply time, thus
terminating the inflight status of the old configuration change. But since raft
will forward the proposal to the leader where it is rejected (since it's
incompatible with the active, much newer, config), this would never happen,
thus leaking an inflight[^1].

The other case became obvious when we tried to work around the above by
terminating the config change when we noticed that raft rejected it. Because
the rejection is not made against a stable basis (it's against whatever config
is active at the given time), legitimate configuration changes would be
rejected frequently. In particular, this could occur in a way that would still
result in the respective entries becoming applied later! This caused corruption
in CockroachDB[^2] by leading to a divergence between what we thought the config
was and what it actually was.

CockroachDB does have a higher-level protection mechanism (configuration
changes are serialized through transactions reading from and writing to the
same key). The intention is for CockroachDB to use this setting, however it is
likely that other systems are susceptible to similar issues (and perhaps
unknowingly so). This setting is thus of general interest.

Touches etcd-io/raft#80.

[^1]: cockroachdb/cockroach#105797 (comment)
[^2]: cockroachdb/cockroach#106172 (comment)

Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
tbg added a commit to tbg/cockroach that referenced this issue Jul 18, 2023
Relies on etcd-io/raft#81.

We don't want it to, since that causes issues due to false positives. We are
taking responsibility for carrying out only valid conf changes, as we always
have.

See also etcd-io/raft#80.

Fixes cockroachdb#105797.
Epic: CRDB-25287
Release note (bug fix): under rare circumstances, a replication change could get
stuck when proposed near lease/leadership changes (and likely under overload),
and the replica circuit breakers could trip. This problem has been addressed.
Note to editors: this time it's really addressed (fingers crossed); a previous
attempt with an identical release note had to be reverted.
@ahrtr ahrtr added this to the v4.0.0 milestone Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants