Prove Raft Correct

r0naa · on Aug 6, 2015

Amazing job! It has been mentioned on the Raft mailing list that a proof was in a progress, but to be honest I did not expect anything to come up before a year or so. Forgive my lack of faith! :-)

Well done!

edit: to provide some context

- Raft is a distributed consensus algorithm that is seen by many as a viable alternative to *-Paxos because of its relative simplicity. It was created by D. Ongaro and J. Ousterhout (Tcl/Tk!)

- D. Ongaro's dissertation includes a TLA+ specifications of Raft; TLA+ is a model checker (see ) for a more detailed explanation of what is a model checker/theorem prover's job

- TLA+ is a model checker; Coq is a proof assistant. See https://stackoverflow.com/questions/22418448/can-coq-be-used... for a more detailed explanation.

- Verdi is a Coq framework to make formal proof about distributed systems.

- Doug Woos and James Wilcox, made a proof of Raft using Verdi; Verdi helps you figure whether your implementation of X (in this case X=Raft) meets the specifications (Leader Safety, strong consistency etc..)

Link to the Verdi website: http://verdi.uwplse.org/

edit: as noted by ahelwer, TLA+ is not a model checker but a language to describe a distributed system's specifications. I was referring to TLC which can work with TLA+.

dougws · on Aug 6, 2015

Hey, I'm Doug Woos--thanks for this excellent summary! It's worth noting that the Raft proof was completed by a team of people, including me, my research partner James Wilcox (http://homes.cs.washington.edu/~jrw12/), and the other folks listed on our web page at http://verdi.uwplse.org/

r0naa · on Aug 6, 2015

Apologies to your research partner for leaving him out, it's edited! Congratulations on the proof by the way, I am looking forward to this week-end so I can have some time to appreciate it with more depth! :-)

dougws · on Aug 6, 2015

Thanks! Let us know if you have any questions about the proof itself--unfortunately, it's not particularly well-documented, and Coq isn't super easy to read in the first place.

notsony · on Aug 6, 2015

Can you recommend and (1) books and (2) online lectures/courses on proofs which use Coq? Thanks!

dougws · on Aug 6, 2015

At this point, the standard intro text is Software Foundations [1]. I highly recommend it; it will teach you Coq and also probably make you a better programmer. After SF, Certified Programming with Dependent Types [2] gets more into the practice of proving serious programs correct. These books are both available online in the form of literate Coq files.

As far as online lectures, OPLSS [3] often has Coq lectures which are quite good.

[1] http://www.cis.upenn.edu/~bcpierce/sf/current/index.html

[2] http://adam.chlipala.net/cpdt/

[3] https://www.cs.uoregon.edu/research/summerschool/summer15/

notsony · on Aug 6, 2015

Thanks!

whitten · on Aug 6, 2015

What would be the process to show that another key-value pair system is able to satisfy all the specifications so it is comparable to vard ?

dougws · on Aug 6, 2015

One could imagine doing some kind of differential testing of another key-value store against vard to ensure that for a random but large number of requests, they return the same value. I'm not sure that doing that would gain you anything over using Jepsen [1], though. One could also imagine implementing another key-value store in Verdi and proving it equivalent to vard; reasons to do this might include using better data structures or serialization mechanisms, or a different consensus protocol. I'm not sure that this would require significantly less effort than proving this new key-value store (usrd? whittend? :)) correct from scratch, but I'd love to see someone do this!

[1] https://github.com/aphyr/jepsen

ahelwer · on Aug 6, 2015

TLA+ is a formal specification language, not a model checker. There exists a model checker called TLC which works with TLA+. There also exists an associated proof system called TLAPS; TLAPS has been used to formally prove correctness of Byzantine Paxos.

r0naa · on Aug 6, 2015

Right, I have edited my original comment.

Confusion · on Aug 6, 2015

What isn't yet clear to me about these kinds of proofs: how can you verify that this proof is correct? How can you know there isn't a wrong assumption or wrong bit of 'code' in those 5500 lines? Would it have been impossible for Voevodsky to make his original mistake in such a formalization? Obviously the point of his entire homotopy type theory project is that he thinks he couldn't have. But why are mistakes (near) impossible here?

wilcoxjay · on Aug 6, 2015

This is a great question that permeates all verification work. In the lingo, we call the set of assumptions a proof makes the "trusted computing base" (see https://en.wikipedia.org/wiki/Trusted_computing_base ).

I listed some of the assumptions we made in another comment https://news.ycombinator.com/item?id=10018985 .

The question you ask was believed for a long time to be the death knell of formal verification. It was persuasively argued in "Social Processes and Proofs of Theorems and Programs" https://www.cs.umd.edu/~gasarch/BLOGPAPERS/social.pdf that any proof of a complex system is necessarily at least as complex. Thus there would be no reason to trust the proof any more than the original code.

The breakthrough in verification came when we started using machine-checked proofs. In this approach, you write a short, simple, and trusted proof checker. You can then verify complex systems using complex proofs that are checked by the simple checker. Then the only possible errors in the proof can come from the proof checker being wrong. Because the checker is simple and general (ie, it can check all kinds of proofs, not just proofs about a particular program), it is more trustworthy.

Just to reiterate (including some content from my other comment): we can be wrong only if we have written the wrong definition of linearizability, misstated the correctness theorem, or if there is a bug in the proof checker or in Coq's logic itself.

pavpanchekha · on Aug 6, 2015

To add on to wilcoxjay's excellent response, this sort of proof development is exactly what Voevodsky would like to (eventually) see happen in mathematics. The linked proof of Raft uses Coq, which is a small proof checker which makes sure that no mistakes were made in the proof. Simpler, "paper" proofs had already been done for Raft.

ongardie · on Aug 6, 2015

I included the paper proof for Raft in my PhD dissertation, but there's a good chance it contains errors. Here's a quote from the intro: "The proof shows that the specification preserves the State Machine Safety property. The main idea of the proof is summarized in Section 3.6.3, but the detailed proof is much more precise. We found the proof useful in understanding Raft’s safety at a deeper level, and others may find value in this as well. However, the proof is fairly long and difficult for humans to verify and maintain; we believe it to be basically correct, but it might include errors or omissions. At this scale, only a machine-checked proof could definitively be error-free." From Appendix B of my dissertation: https://github.com/ongardie/dissertation/#readme

This Coq proof is an enormous step forward. It's that machine-checked proof I was hoping someone would do when I wrote the above paragraph. 1) Assuming the TCB is correct, we know the Verdi implementation satisfies linearizability. 2) Assuming the TCB is correct and the Verdi implementation and the Raft paper/dissertation/spec are equivalent, then the Raft algorithm satisfies linearizability too.

I'm more willing to believe that the Verdi implementation implements Raft faithfully than I am willing to believe that my hand proof is correct. If you're really worried about (2), you have the option of using the Verdi generated implementation. If you're less worried about (2), you can now have more confidence in the safety of the Raft algorithm in general.

mjb · on Aug 6, 2015

Extremely cool stuff.

Questions for the researchers, if they are reading:

- Does this proof cover any liveness concerns (weak fairness, deadlock freedom, etc) in addition to the safety property of linearizability?

- If not, what would it take to extend this model to cover liveness? Is this even a good starting point?

dougws · on Aug 6, 2015

We are reading! This is a great question. We currently don't prove anything about liveness. We'd love to work on this.

As you probably know, Raft and other consensus algorithms are not guaranteed to be live in all situations. But subject to some assumptions about the frequency of failure, they are guaranteed to make progress.

In Verdi, systems are verified with respect to semantics for the network they are running on. Our semantics currently don't include any restrictions about how often failures can happen; a failure "step" can occur at any time. We're not sure what the best way is to introduce this kind of restriction, but we've got a couple ideas. One would be to guarantee that the total number of failures has some finite bound which is unknown to the system itself but which is available during verification. Another would be to model failure probabilistically. We will probably end up doing at least one of these things in the next year or so :).

mjb · on Aug 6, 2015

Thanks. That sounds like extremely interesting research. Being able to say things about liveness relative to probability of failure (or the distribution of probability of failure) would be very interesting.

nernst · on Aug 6, 2015

This is interesting work. I'm curious how confident we can be that the TLA+ proof from Diego Ongaro was correctly represented in Verdi/Coq. This still seems like a manual, hard-to-verify process.

wilcoxjay · on Aug 6, 2015

Hey, I'm James Wilcox, another member of the Verdi team.

This is a good question. More broadly, what do you have to trust in order to believe our claims?

In addition to all the usual things (like the soundness of Coq's logic and the correctness of its proof checker), the most important thing you need to trust is the top-level specification. In our case, this is linearizability (have a look at https://github.com/uwplse/verdi/blob/master/raft/Linearizabi... for the key definition; the key theorem statement is at https://github.com/uwplse/verdi/blob/master/raft-proofs/EndT... ). If you can convince yourself that these correspond to your intuitive understanding of linearizability, then you don't need to trust any other theorem statement or definition in the development.

If you actually want to run our code, then you need to make several other assumptions. Our system runs via extraction to OCaml, so you must trust the extractor and the OCaml compiler and runtime. In addition we have a few hundred lines of OCaml to hook up the Coq code to the real world (eg, writing to disk and putting bits on the network).

To respond more directly to your question about Diego's proof, I can tell you we referred to it often to get the high-level idea. But the TLA model differs in several respects from our implementation of Raft in Verdi. Most importantly, our code is an implementation in the sense that you can run it. This means that it resolves all nondeterminism in the specification. Furthermore, there is no need to manually check that what we implemented matches the TLA model, unless that is your preferred means for convincing yourself that we really did implement Raft.

quibit · on Aug 6, 2015

What exactly is a consensus algorithm and how do you prove linearizability?

macintux · on Aug 6, 2015

Kyle Kingsbury has a good look at linearizability and other forms of consistency: https://aphyr.com/posts/313-strong-consistency-models

r0naa · on Aug 6, 2015

Kyle Kingsbury is the man, and his blog is a must read to anyone interested/curious by distributed systems.

quibit · on Aug 6, 2015

That was a great read. Thanks for a new blog to put on my RSS feed

baq · on Aug 6, 2015

the call me maybe series there is an absolute must-read for any post-web 2.0 developer.

r0naa · on Aug 6, 2015

distributed consensus is about having a group of processes agree on a single data value. For example, imagine you have a cluster of server. Each node has a replicated database.

(1) You want all your nodes to have the exact same replica of the database i.e consistency across your cluster.

You would need to reach consensus before any node actually adds anything to its local database to make sure that property (1) is fulfilled.

== Linearizability is just a consistency model i.e a variant of property 1 with stronger/weaker constraints.

quibit · on Aug 6, 2015

Oh wow, that's actually really powerful stuff! Thanks for the explanation.