Amazing job! It has been mentioned on the Raft mailing list that a proof was in a progress, but to be honest I did not expect anything to come up before a year or so. Forgive my lack of faith! :-)
Well done!
edit: to provide some context
- Raft is a distributed consensus algorithm that is seen by many as a viable alternative to *-Paxos because of its relative simplicity. It was created by D. Ongaro and J. Ousterhout (Tcl/Tk!)
- D. Ongaro's dissertation includes a TLA+ specifications of Raft; TLA+ is a model checker (see ) for a more detailed explanation of what is a model checker/theorem prover's job
- Verdi is a Coq framework to make formal proof about distributed systems.
- Doug Woos and James Wilcox, made a proof of Raft using Verdi; Verdi helps you figure whether your implementation of X (in this case X=Raft) meets the specifications (Leader Safety, strong consistency etc..)
edit: as noted by ahelwer, TLA+ is not a model checker but a language to describe a distributed system's specifications. I was referring to TLC which can work with TLA+.
Hey, I'm Doug Woos--thanks for this excellent summary! It's worth noting that the Raft proof was completed by a team of people, including me, my research partner James Wilcox (http://homes.cs.washington.edu/~jrw12/), and the other folks listed on our web page at http://verdi.uwplse.org/
Apologies to your research partner for leaving him out, it's edited! Congratulations on the proof by the way, I am looking forward to this week-end so I can have some time to appreciate it with more depth! :-)
Thanks! Let us know if you have any questions about the proof itself--unfortunately, it's not particularly well-documented, and Coq isn't super easy to read in the first place.
At this point, the standard intro text is Software Foundations [1]. I highly recommend it; it will teach you Coq and also probably make you a better programmer. After SF, Certified Programming with Dependent Types [2] gets more into the practice of proving serious programs correct. These books are both available online in the form of literate Coq files.
As far as online lectures, OPLSS [3] often has Coq lectures which are quite good.
One could imagine doing some kind of differential testing of another key-value store against vard to ensure that for a random but large number of requests, they return the same value. I'm not sure that doing that would gain you anything over using Jepsen [1], though. One could also imagine implementing another key-value store in Verdi and proving it equivalent to vard; reasons to do this might include using better data structures or serialization mechanisms, or a different consensus protocol. I'm not sure that this would require significantly less effort than proving this new key-value store (usrd? whittend? :)) correct from scratch, but I'd love to see someone do this!
TLA+ is a formal specification language, not a model checker. There exists a model checker called TLC which works with TLA+. There also exists an associated proof system called TLAPS; TLAPS has been used to formally prove correctness of Byzantine Paxos.
What isn't yet clear to me about these kinds of proofs: how can you verify that this proof is correct? How can you know there isn't a wrong assumption or wrong bit of 'code' in those 5500 lines? Would it have been impossible for Voevodsky to make his original mistake in such a formalization? Obviously the point of his entire homotopy type theory project is that he thinks he couldn't have. But why are mistakes (near) impossible here?
This is a great question that permeates all verification work. In the lingo, we call the set of assumptions a proof makes the "trusted computing base" (see https://en.wikipedia.org/wiki/Trusted_computing_base ).
The question you ask was believed for a long time to be the death knell of formal verification. It was persuasively argued in "Social Processes and Proofs of Theorems and Programs" https://www.cs.umd.edu/~gasarch/BLOGPAPERS/social.pdf that any proof of a complex system is necessarily at least as complex. Thus there would be no reason to trust the proof any more than the original code.
The breakthrough in verification came when we started using machine-checked proofs. In this approach, you write a short, simple, and trusted proof checker. You can then verify complex systems using complex proofs that are checked by the simple checker. Then the only possible errors in the proof can come from the proof checker being wrong. Because the checker is simple and general (ie, it can check all kinds of proofs, not just proofs about a particular program), it is more trustworthy.
Just to reiterate (including some content from my other comment): we can be wrong only if we have written the wrong definition of linearizability, misstated the correctness theorem, or if there is a bug in the proof checker or in Coq's logic itself.
To add on to wilcoxjay's excellent response, this sort of proof development is exactly what Voevodsky would like to (eventually) see happen in mathematics. The linked proof of Raft uses Coq, which is a small proof checker which makes sure that no mistakes were made in the proof. Simpler, "paper" proofs had already been done for Raft.
I included the paper proof for Raft in my PhD dissertation, but there's a good chance it contains errors. Here's a quote from the intro:
"The proof shows that the specification preserves the State Machine Safety property. The main
idea of the proof is summarized in Section 3.6.3, but the detailed proof is much more precise. We
found the proof useful in understanding Raft’s safety at a deeper level, and others may find value
in this as well. However, the proof is fairly long and difficult for humans to verify and maintain;
we believe it to be basically correct, but it might include errors or omissions. At this scale, only a
machine-checked proof could definitively be error-free."
From Appendix B of my dissertation: https://github.com/ongardie/dissertation/#readme
This Coq proof is an enormous step forward. It's that machine-checked proof I was hoping someone would do when I wrote the above paragraph.
1) Assuming the TCB is correct, we know the Verdi implementation satisfies linearizability.
2) Assuming the TCB is correct and the Verdi implementation and the Raft paper/dissertation/spec are equivalent, then the Raft algorithm satisfies linearizability too.
I'm more willing to believe that the Verdi implementation implements Raft faithfully than I am willing to believe that my hand proof is correct. If you're really worried about (2), you have the option of using the Verdi generated implementation. If you're less worried about (2), you can now have more confidence in the safety of the Raft algorithm in general.
We are reading! This is a great question. We currently don't prove anything about liveness. We'd love to work on this.
As you probably know, Raft and other consensus algorithms are not guaranteed to be live in all situations. But subject to some assumptions about the frequency of failure, they are guaranteed to make progress.
In Verdi, systems are verified with respect to semantics for the network they are running on. Our semantics currently don't include any restrictions about how often failures can happen; a failure "step" can occur at any time. We're not sure what the best way is to introduce this kind of restriction, but we've got a couple ideas. One would be to guarantee that the total number of failures has some finite bound which is unknown to the system itself but which is available during verification. Another would be to model failure probabilistically. We will probably end up doing at least one of these things in the next year or so :).
Thanks. That sounds like extremely interesting research. Being able to say things about liveness relative to probability of failure (or the distribution of probability of failure) would be very interesting.
This is interesting work. I'm curious how confident we can be that the TLA+ proof from Diego Ongaro was correctly represented in Verdi/Coq. This still seems like a manual, hard-to-verify process.
Hey, I'm James Wilcox, another member of the Verdi team.
This is a good question. More broadly, what do you have to trust in order to believe our claims?
In addition to all the usual things (like the soundness of Coq's logic and the correctness of its proof checker), the most important thing you need to trust is the top-level specification. In our case, this is linearizability (have a look at https://github.com/uwplse/verdi/blob/master/raft/Linearizabi... for the key definition; the key theorem statement is at https://github.com/uwplse/verdi/blob/master/raft-proofs/EndT... ). If you can convince yourself that these correspond to your intuitive understanding of linearizability, then you don't need to trust any other theorem statement or definition in the development.
If you actually want to run our code, then you need to make several other assumptions. Our system runs via extraction to OCaml, so you must trust the extractor and the OCaml compiler and runtime. In addition we have a few hundred lines of OCaml to hook up the Coq code to the real world (eg, writing to disk and putting bits on the network).
To respond more directly to your question about Diego's proof, I can tell you we referred to it often to get the high-level idea. But the TLA model differs in several respects from our implementation of Raft in Verdi. Most importantly, our code is an implementation in the sense that you can run it. This means that it resolves all nondeterminism in the specification. Furthermore, there is no need to manually check that what we implemented matches the TLA model, unless that is your preferred means for convincing yourself that we really did implement Raft.
distributed consensus is about having a group of processes agree on a single data value. For example, imagine you have a cluster of server. Each node has a replicated database.
(1) You want all your nodes to have the exact same replica of the database i.e consistency across your cluster.
You would need to reach consensus before any node actually adds anything to its local database to make sure that property (1) is fulfilled.
== Linearizability is just a consistency model i.e a variant of property 1 with stronger/weaker constraints.
Well done!
edit: to provide some context
- Raft is a distributed consensus algorithm that is seen by many as a viable alternative to *-Paxos because of its relative simplicity. It was created by D. Ongaro and J. Ousterhout (Tcl/Tk!)
- D. Ongaro's dissertation includes a TLA+ specifications of Raft; TLA+ is a model checker (see ) for a more detailed explanation of what is a model checker/theorem prover's job
- TLA+ is a model checker; Coq is a proof assistant. See https://stackoverflow.com/questions/22418448/can-coq-be-used... for a more detailed explanation.
- Verdi is a Coq framework to make formal proof about distributed systems.
- Doug Woos and James Wilcox, made a proof of Raft using Verdi; Verdi helps you figure whether your implementation of X (in this case X=Raft) meets the specifications (Leader Safety, strong consistency etc..)
Link to the Verdi website: http://verdi.uwplse.org/
edit: as noted by ahelwer, TLA+ is not a model checker but a language to describe a distributed system's specifications. I was referring to TLC which can work with TLA+.