Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLA+ spec and trace validation for raft consensus algorithm in etcd implementation #111

Open
joshuazh-x opened this issue Nov 23, 2023 · 24 comments

Comments

@joshuazh-x
Copy link
Contributor

There have been multiple formal specifications of raft consensus algorithm in TLA+, following Diego Ongaro's Ph.D. dissertation. However, I have not seen a version that aligns to the raft library implemented in etcd-io/raft, which we know are different from the original raft algorithm in some behaviors, e.g., reconfiguration.

etcd and many other applications based on this raft library have been running good for long time, but I feel it would still be worthy to write a TLA+ spec for this specific implementation. It is not just to verify the correctness of the model, but also a foundation of a following up work in model-based trace validation.

The spec is based on George Pîrlea's raft TLA+ spec and inspired by the innovative work in Microsoft CCF's TLA+ spec.

@joshuazh-x
Copy link
Contributor Author

PR ref #112

@serathius
Copy link
Member

Thanks for proposing this @joshuazh-x, this looks very useful to validate correctness.

Couple of questions from me:

  • How do we ensure that provided tla definition is aligned with raft implementation? Not sure all contributors to raft are TLA+ expects, so the raft and TLA+ could drift apart over time. Do you have any plans about integrating it into raft tests?
  • TLA+ should allow us to do continuous formal verification of raft algorithm, but it also means we need to maintain two implementations (Go and TLA+). Do you have any thoughts how we could reduce maintenance cost?
  • Providing the model is great for people familiar with the TLA+ tooling, for those who are not could you prepare a readme how it could be used? Have you considered having automated testing so we don't break it by mistake?

@serathius
Copy link
Member

cc @tbg @Elbehery @pavelkalinnikov @erikgrinaker

@Elbehery
Copy link

Thanks @joshuazh-x

+1 @serathius, if we plan to add/merge #112, and replace it with current raft impl, we need to exhaust test this in the first place. Also I do not think maintaining two separate impl is a good idea.

iiuc, many projects are based on https://github.com/etcd-io/raft, not only etcd

@pav-kv
Copy link
Contributor

pav-kv commented Nov 23, 2023

How do we ensure that provided tla definition is aligned with raft implementation? Not sure all contributors to raft are TLA+ expects, so the raft and TLA+ could drift apart over time. Do you have any plans about integrating it into raft tests?

One idea: generate the TLA+ spec from a standalone Go package. The spec would be broken down into a bunch of exported constants. We can then reference these constants from the implementation, to complement the documentation.

For example, suppose the spec has something like this HandleAppendEntriesRequest. Our etcd-io/raft/tlaplus package would have a constant like:

const HandleAppendEntriesRequest = `
\* Server i receives an AppendEntries request from server j with
\* m.mterm <= currentTerm[i]. This just handles m.entries of length 0 or 1, but
\* implementations could safely accept more by treating them the same as
\* multiple independent requests of 1 entry.
\* @type: (Int, Int, AEREQT) => Bool;
HandleAppendEntriesRequest(i, j, m) ==
    LET logOk == \/ m.mprevLogIndex = 0
                 \/ /\ m.mprevLogIndex > 0
                    /\ m.mprevLogIndex <= Len(log[i])
                    /\ m.mprevLogTerm = log[i][m.mprevLogIndex].term
    IN 
       /\ m.mterm <= currentTerm[i]
       /\ \/ RejectAppendEntriesRequest(i, j, m, logOk)
          \/ ReturnToFollowerState(i, m)
          \/ AcceptAppendEntriesRequest(i, j, logOk, m)
       /\ UNCHANGED <<candidateVars, leaderVars>>
`

// Spec is the complete TLA+ spec that can be exported and
// run through the TLA+ model checker / tooling.
const Spec = Thing1 +
	HandleAppendEntriesRequest +
	Thing2 + ...;

The implementation of the AppendEntries handler would then reference this constant like this:

... somewhere in the weeds of MsgApp handler ...
_ = tlaplus.HandleAppendEntriesRequest

This is convenient to work with in IDEs, where you can jump to the definition of this constant at a keystroke.

@pav-kv
Copy link
Contributor

pav-kv commented Nov 23, 2023

TLA+ should allow us to do continuous formal verification of raft algorithm, but it also means we need to maintain two implementations (Go and TLA+). Do you have any thoughts how we could reduce maintenance cost?

IMO, the current implementation has a monolithic design that makes maintenance hard either way. To reduce the maintenance cost, we need to separate the core of the algorithm from the implementation details that don't impact correctness. The core algorithm won't change much, and would be paired with the TLA+ spec for assurance. Implementation details (like buffer sizes, various quota and retry aspects) don't need a spec, and are also more prone to changes.

@joshuazh-x
Copy link
Contributor Author

Thank you very much for your feedback and suggestions.

@serathius I believe your first two questions can be addressed by the trace validation in another PR #113. PR #112 is actually a subset of PR #113 , which is purely about algorithm itself. PR #113 is more about how to guarantee the implementation matches TLA+ spec. I thought it might be good to separate them in different issues as I only tested it in etcd for now. But since your pops out this good question, I think we can discuss them together.

IMO, the correctness of service includes two parts: correctness of algorithm, and alignment of implementation and algorithm. In PR #113 we inject multiple trace log points in the code places where state transitions are made (e.g. SendAppendEntries, BecomeLeader, etcd). The trace validation spec uses these traces as a state space constraint so that it walks the state machine in the same way the service did. If there is a trace that indicates any state or transition that cannot happen in the state machine, it implies a mismatch between the model and implementation. If the model is already verified by TLC model checker, then most likely it is the implementation that has some issue.

And regarding the third ask, yes, I'll draft a readme to outline how to use this tool.

@joshuazh-x
Copy link
Contributor Author

@Elbehery the TLA+ spec is not a new implementation, nor meant to replace existing raft lib. You can consider it as the design (in the format of formal spec) of current raft, and we can use TLC tool to verify its correctness.
It provides another way to correctness validation and theoretically can touch all possible scenarios that normal test cases usually won't be able to cover.

@joshuazh-x
Copy link
Contributor Author

pavelkalinnikov

@pavelkalinnikov yes, we don't need to put all implementation details into the spec, just the states and transitions of the basic algorithm would be enough. This part is not subject to change often.

But the question from @serathius is also important, that is something we need to assure eventually for the correctness of service running on top of raft. That's why we propose TLA+ trace validation in PR #113.

@joshuazh-x joshuazh-x changed the title TLA+ spec for raft consensus algorithm in etcd implementation TLA+ spec and trace validation for raft consensus algorithm in etcd implementation Nov 24, 2023
@pav-kv
Copy link
Contributor

pav-kv commented Nov 24, 2023

I like the idea of tracing and integrating with TLA+ for validation. Thank you for doing this.

Although maintaining two implementations is challenging (only at times when the code changes conflict with the spec though, which should be rare; moreover, it's a good thing - there should be some resistance to risky changes), TLA+ validation would add extra safety net around the current implementation. The spec will not rot if it's integrated with the code this way.

One component that would be great to have is randomized testing. We have a datadriven test harness that makes testing deterministic and reproducible, but we would greatly benefit if these tests were randomized too (but still reproducible). If we then feed the random traces into TLA+ checks, and run this whole thing for hours, it will massively boost the coverage.

One request for this TLA+ integration: it should be "free" w.r.t. the production code. Please hide it under cheap checks, and/or build tags like:

if tla.Enabled { // this compiles to "if false" and optimizies away in prod
	traceEvent(...)
}
// +build with_tla

package tla

const Enabled = true

// all helpers here
// +build !with_tla

package tla

const Enabled = false

// all helpers are no-ops

Consider also that we already have a similar form of tracing, which is similarly fed into tests, and used for deterministic validation. E.g. see an example test output. Consider integrating the two mechanisms.

@xiang90
Copy link
Contributor

xiang90 commented Nov 27, 2023

I chatted with both Lamport/Yuan in 2015 to discuss this issue (https://www.microsoft.com/en-us/research/publication/specifying-and-verifying-systems-with-tla/)

Here is the "conclusion" from them:

There were multiple efforts in the past to compile TLA+ spec to actual code, but as far as I can tell none of them were very successful. One promising approach is to write a mapping (i.e., refinement function) that maps implementation states in for example C++ to specification state. This would allow you to check the specification as a property of the implementation. It also allows you to use the specification to drive the testing of the implementation.

I guess we can probably define this mapping from Go to TLA+ as well, considering the current Go raft implementation is relatively small and only uses a very constrained subset of Go libraries. But I am not sure if it is worth the effort. Or maybe we just need to manually maintain the two different implementations and use TLA+ to guide the thought process.

@lemmy
Copy link

lemmy commented Nov 28, 2023

Hello, I'm the author of trace validation in CCF and a member of the TLA+ project.

One promising approach is to write a mapping (i.e., refinement function) that maps implementation states in for example C++ to specification state. This would allow you to check the specification as a property of the implementation.

Trace Validation (TV) essentially adopts this approach by mapping recorded implementation states to TLA+ specification states. This method is practical as it derives implementation states directly from the execution of the implementation, avoiding the complexities of symbolic interpretation of the source code. However, TV is not exhaustive. It verifies that only a certain subset of all potential system executions aligns with the behaviors defined in the specification. Nevertheless, this is not necessarily a limitation if the chosen subset of system executions is sufficiently diverse. For example, in CCF, among other discrepancies, we were able to find two significant data-loss issues using TV with just about a dozen system executions.

It also allows you to use the specification to drive the testing of the implementation.

In CCF, this feature is still on our roadmap. However, manually translating TLA+ counterexamples (violations of the spec's correctness properties) into inputs for the implementation was straightforward.

@joshuazh-x
Copy link
Contributor Author

Updated PR #113 to add readme and build tag. @serathius @pavelkalinnikov

@serathius
Copy link
Member

Things I got from the discussion:

List of followups (please confirm I haven't missed anything)

  • Integrate TLA+ trace validation with raft datadriven tests
  • Add a workflow that executes continuous validation
  • Propose a method of keeping references between code and TLA+. Consider validation to prevent breakage.

Overall LGTM from me. Would also request an official LGTMs from people active in the project to ensure we get consensus: @ahrtr @Elbehery @pavelkalinnikov @erikgrinaker

@serathius
Copy link
Member

ping @ahrtr @Elbehery @pavelkalinnikov @erikgrinaker

@pav-kv
Copy link
Contributor

pav-kv commented Dec 14, 2023

LGTM

@ahrtr
Copy link
Member

ahrtr commented Dec 18, 2023

Thanks for raising this issue.

  • It looks like a nice-to-have feature, and overall I agree with the direction. But I am not convinced it's the priority for now, at least not in 3.6 timeframe (I don't think we are ready to spend huge effort on it in 3.6 timeframe).
    • If it's a low-hang fruit, it's fine to do it right away. Obviously it isn't. Instead, it needs huge effort and also increases the maintenance burden based on the discussion above.
  • I do not see a feasible & detailed doc/plan either to guarantee on
    • How to do it in a controlled way
    • How to qualify the work

@pav-kv
Copy link
Contributor

pav-kv commented Dec 18, 2023

I think this is very useful work, can we find some low-risk path forward and not stall this work / lose the momentum entirely?

Do I understand correctly that the actual code change should be minimal? My understanding is that it boils down to a dozen or so "tracing" calls in key places, and the whole thing is active only in tests. From this point of view, this work can be done with low risk (does this answer the "controlled" part?).

Some part of this work is "dev infra" to enable checks in CI. I think this part is skippable at the beginning while the author is working on the spec and periodically runs the validation. Providing an instruction on running it manually should be fine.

Most of this work seems to be the actual spec. Qualifying it obviously requires a reviewer who understands TLA+. In this group of people, is there anyone? For myself, I'm interested in learning this topic and started doing so a while ago. I can potentially devote a low-priority / hobby stream of effort to learn and review this.

@joshuazh-x can you think of a way to make incremental progress / split #113 into steps so that spec is introduced piece by piece? For example, is it possible to do spec just for election, and then add appends, etc? The reviewer needs to "follow" some train of thought to not get lost in this PR.

@pav-kv
Copy link
Contributor

pav-kv commented Dec 18, 2023

@joshuazh-x do you have an end-to-end prototype of this that you've been running for a while and it passed the checks? Or found some spec violations / bugs? The value of this work can be more apparent if it finds a real bug as a sneak peek :) But maybe the code is bug-less, then it's harder to sell this work.

But no spec violations is a good signal too. In this case, for demo purposes, you could inject some subtle bugs and demonstrate how the spec checker would find them.

@serathius
Copy link
Member

It looks like a nice-to-have feature, and overall I agree with the direction. But I am not convinced it's the priority for now, at least not in 3.6 timeframe (I don't think we are ready to spend huge effort on it in 3.6 timeframe).

Please note that this work doesn't directly block work on v3.6 nor does it slows it down. As for priority this effort contributes to correctness testing which I'm always open to. I also see a potential to utilize TLA+ validation to improve etcd membership testing which is hard to properly test using robustness tests and has been causing us issues with v2 deprecation.

@lemmy
Copy link

lemmy commented Dec 20, 2023

Most of this work seems to be the actual spec. Qualifying it obviously requires a reviewer who understands TLA+. In this group of people, is there anyone? For myself, I'm interested in learning this topic and started doing so a while ago. I can potentially devote a low-priority / hobby stream of effort to learn and review this.

The actual spec in PR #113 is directly derived from Ongaro's original raft spec. We could even go as far as model-checking or proving that etcdraft formally refines Ongaro's spec to minimize the of reviewing the spec. However, what this really means is that the review can be narrowed down to the specification of etcd's reconfiguration logic, which boils down to a handful of TLA+ actions (etcdraft.tla line 418 to 487).

@ahrtr
Copy link
Member

ahrtr commented Dec 20, 2023

The actual spec in PR #113 is directly derived from Ongaro's original raft spec.

Thanks for the info, which gives us much more confidence. I will take a closer look at both the issue and PR later.

@joshuazh-x
Copy link
Contributor Author

@joshuazh-x do you have an end-to-end prototype of this that you've been running for a while and it passed the checks? Or found some spec violations / bugs? The value of this work can be more apparent if it finds a real bug as a sneak peek :) But maybe the code is bug-less, then it's harder to sell this work.

But no spec violations is a good signal too. In this case, for demo purposes, you could inject some subtle bugs and demonstrate how the spec checker would find them.

I have a PoC to enable trace validation in etcd's test suites. Basically collecting protocol traces and validate them using the TLA+ spec. I now realize it might be better to make a random test with trace validation here in Raft repo. I'm working on this and will update PR when it is ready.

I personally see trace validation as a good tool to validate the correctness of implementation based on a proven model. For now the trace validation only covers core raft algorithm (including reconfiguration). I don't expect some bug can be found here. But trace validation itself is expandable. Any application based on raft can build its own trace validation on top of raft trace validation. In such a way, their own specific features can be covered. One example off the top of my head is etcd's lease manager, which I heard would need to redesign to address consensus issue.

@joshuazh-x
Copy link
Contributor Author

I added a test case to feed random faults to a cluster and generate trace logs for trace validation. Also update readme to include instructions to run trace validation (last part of the readme file).

To generate test trace logs, run TestRunClustersWithRandomFaults with with_tla flag and environment variable TEST_RACE_LOG set to a folder to save trace logs.
TEST_TRACE_LOG=/tmp/ramdisk go test -timeout 30m -tags with_tla -run ^TestRunClustersWithRandomFaults$ go.etcd.io/raft/v3/rafttest -tags=with_tla
Then run trace validation as below. Note, you can set environment variable MAX_TRACE to a smaller number N so that trace validation will only check top N trace logs.
./validate.sh -s ./Traceetcdraft.tla -c ./Traceetcdraft.cfg /tmp/ramdisk/*.ndjson
So far I did not find any counter example, which is expected. Manually change raft code can be caught be trace validation.
Let me know if there is any issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants