Distributed consensus

This page is a relatively informal discussion of distributed consensus and Paxos, what it does, how it works, and some tricks and variants.

The problem

Distributed consensus algorithms are a method of making consistent systems more reliable (or, depending on your perspective, of making reliable systems more consistent).

A system, such as a database, receives external requests and changes from one state to another. If it’s consistent, state changes should be permanent: Once it changes from state A to B, it should never forget that, or decide that it changed from A to C instead. (It might change back to state A, but that would be due to a future B -> A request.)

Let’s consider just a single state change, from a starting state to some other state. Different clients have proposals for the new state, and they send requests of the form “switch from the starting state to state X” – only one of these can succeed. If we can make a single consistent state change, we can do many of them to make a more complicated system.

This is easy to implement in a non-distributed system: Just take the requests in order, and accept the first one. But in a distributed system, this is tricky: We want the system as a whole to change state at some point, from “starting state” to “state X”, and never be able to forget that, even if individual parts of the system fail.

The API we’ll implement, concretely, is a “write-once cell” (also called a “sticky cell”): It starts out empty, and clients can try to write values to it, but only one write will succeed. That is, our API is get() and set(x), where every get() will either fail (don’t know the value yet) or succeed with a value. The important properties – called “safety” properties – are that every successful get() must return the same value, and that that value must have been an argument to set().¹

A bit about the model: We’re assuming a completely asynchronous system, meaning that there are no synchronized clocks or bounds on how long a message can take to arrive. Computers communicate with each other by sending messages. Messages may be dropped or reordered.

Rather than explicitly model computer failure or downtime, we can model it by dropping all messages being sent to and from the failed computer. This assumes that each computer has reliable persistent storage, so if it crashes and comes back up, it goes back to the state it was in before.

In practice, we’ll use some number of storage machines, and the rest of the system won’t need persistent storage, but the exact setup can be varied in many ways, some of which are discussed below.

We assume non-Byzantine failures, meaning that computers will only fail by dropping packets, and not by sending invalid or malicious data.

What makes this tricky?

The single-computer system is very easy to make, but we want to make it more reliable², which means being able to tolerate individual computers failing while the entire system continues to operate and accept requests without losing data. Of course the system can’t keep working if every computer goes down, but we want to, say, tolerate the failure of up to one computer at a time, and keep operating. Generalizing to f failed computers at a time uses the same ideas, so we’ll stick to tolerating one failure for now.

A first approach you might try is to have two computers, a primary and a backup. At first, all clients send requests to the primary, but if it becomes unreachable, they use the backup instead. Unfortunately, this approach is unworkable. Imagine there’s a network partition between the primary and the backup, so packets from one side can’t reach the other (as far as either computer is concerned, the other one went down). If both the primary and the backup keep accepting requests, they could get into an inconsistent state. But if either of them stops accepting requests, that means that (from their perspective) the system stopped functioning entirely, despite losing only one computer.

So any approach with only two computers is doomed to fail. But the above argument doesn’t apply with three computers: If one computer becomes partitioned from the other two, it can’t keep accepting requests, but the other two can still talk to each other, which breaks symmetry and allows one side of the partition to keep running. As we’ll see, three computers are enough (to tolerate the failure of one computer at a time).

Another important general constraint: No matter how many computers we use to implement this, to set the value, we’re always going to have to interact (directly or indirectly) with at least two of them. If our implementation of “set” only interacts with one computer, which processes its request without interacting with any of the others, it might crash immediately after that. At that point the rest of the system is in trouble: Two computers are up (so it should continue to function), but none of them know about the existing value. If the rest of the system can implement a set request for a different value, we have a conflict; if it can’t, the system is permanently stuck, and can never serve another set or get request successfully.

So every successful set(x) request is going to involve interacting with at least two computers. Moreover, we need to handle one computer being down, so it can’t require writing data to all three computers. Two is both necessary and sufficient.

A single cell

If you’re implementing the write-one cell API on a single thread on a single computer, it’s very easy: Just have a regular memory cell. To get(), just read the cell. To set(x), first read the cell. If it’s empty, set it to x; if it’s not empty, just fail (do nothing). We’re going to extend this approach to multiple computers.

As discussed above, we need at least three computers, so let’s use three. We also need every successful write to write some data to two computers. A set of two computers is what’s called a “quorum”; the important property is that any two quorums intersect. That means that if you write a value to two computers, and then someone else reads the values at two computers, they’re guaranteed to see your value. (More options for quorums are discussed later.)

So here’s a proposal: Each computer stores a value candidate. To set a value successfully, we need to write it to at least two computers; to get the value, we send read requests to all three computers, and if at least two of them report the same value, that value must have been successfully set. (You can imagine doing fancy erasure coding to store data more efficiently when you write it to multiple places, especially as the number of failures we want to tolerate increases, but we’ll simply write multiple copies.)

It’s easy to see that this satisfies the safety properties: At most one value can be written to two computers. Unfortunately, it’s easy to get stuck if multiple people try to set different values at the same time:

[A] [B] [C]

Also, even if one client succeeds at first:

[A] [A] [B]

We can get stuck if one computer goes down:

[?] [A] [B]

Here, A was successfully set, but we can’t distinguish A from B, and we have no way to recover.

On the other hand, if we just let anyone overwrite the values stored in individual computers, then the system can’t get stuck, but the safety properties are easily violated. We need something better.

Here’s a proposal for avoiding getting stuck: Before writing to any computers, require clients to “lock” the system. In this case locking means sending a “lock” request to every computer, which will grant the lock if it hasn’t already granted it to another client (it has to write the lock to disk persistently, so it won’t accidentally regrant it if it crashes). If a client gets two or more lock grants from individual computers, that means no one else could have gotten two or more locks, so it’s locked the entire system. At that point it’s allowed to write to the computers, and it’s guaranteed not to have conflicts.

This works for handling failures of the storage machines. But it raises new issues:

What if the locks themselves conflict, e.g. if there are three clients and each one gets one lock? You can imagine addressing this by having clients release their locks if they don’t manage to get two, and retry later.
Bigger problem: What if a client crashes while holding a lock?

It turns out we can solve both of these problems by introducing “preemptible” locks, where even if one client is holding the lock – even if it crashes while holding it – another client can take it. In a synchronous system, we could use timestamps to make locks expire after not being used for a while, but it’s possible to make this work even without clocks, as follows:

Give each lock a “lock ID”. Lock IDs are totally ordered (e.g. they can be integers). Higher lock IDs beat lower lock IDs: If a computer has been locked with ID L, it can still be locked with any higher ID; at that point lock L is invalidated. Every request that uses the lock includes its ID; if the receiver of the request has granted a higher ID since then, it ignores or rejects the request (since it knows the lock is obsolete).

Preemptible locks let us fix the flaws in the above protocol. Each storage machine stores a lock ID, as well as a provisional value and which lock ID was used to set it. To try to set the cell, first “take a lock”, by picking a lock ID L, and sending a lock request to each of the three computers. If two of them grant the lock, that means you have lock L (and no one else does). Once you have the lock, you can send each computer a read request, to ask it what its current value is (and what lock was used to set it). This corresponds to the “reading” part of the simple single-threaded implementation.

In practice, you’d combine these two requests – “lock” and “read” – into a single “lock_and_read” request that tries to lock a storage machine and asks it to return its current state if the lock is successful.

Then look at the values you’ve received:

If you don’t see any values, that means you can set the cell to whatever you want.
If you see at least two equal values set with the same lock, that means a previous client successfully set the cell for the entire system.
If you see only one value, or values (whether equal or different) set at different locks, that means the situation is uncertain. Someone else might have succeeded at setting the system state, or maybe they were interrupted after only writing one value.

The third case is the interesting one. If any previous set succeeded, you can’t set the value you were trying to set. You might not know whether one succeeded before, but to be on the safe side, switch to writing copies of one of the existing values. Specifically, if there are conflicts, it’s always safe to pick the value with the highest lock ID (each lock ID can only have one value, since it was set by the same client), and write that instead of the value you originally wanted to write.

(Every write you do must include the ID of the lock you hold. Usually “write” means a combined lock_if_necessary_and_write request, which the receiver rejects if they’ve granted a higher lock; this lets you write even to receivers that you haven’t locked yet in the read phase.)

Some details about the above:

Why do you have to switch your value even if you don’t see two copies of the same value? Imagine this situation:
```
[A@1] [A@1] [   ]
 L:1   L:1
```
(Notation: A@1 means the value A written with lock #1; L:1 means the latest lock granted by this computer is lock #1.)
As a client, you come along and use lock ID 2 to try to set the values to B. Can you do it? If the first computer is down and you successfully lock the second and third computers, you’ll see this:
```
[ ? ] [A@1] [   ]
  ?    L:2   L:2
```
Even though you only see one copy of A, a previous client might still have written two copies (and did, in this case). Therefore, to be on the safe side, you should set A instead of B.
Why pick the value with the highest lock ID? The answer is “by induction”: If you see conflicting values with different lock IDs, that means the holder of the higher lock didn’t see the value with the lower lock (because if it had, it would have switched to writing copies of that value), or itself saw a value with a higher ID. That means the value with the lower lock couldn’t have been written successfully, so it’s safe to ignore it.
Why aren’t two equal values with different lock IDs good enough to accept a value? Imagine this situation:
```
[A@1] [A@3] [B@2]
 L:3   L:3   L:2
```
Can you just accept the value A here, since there are two copies of it? No, because another client might come along with lock ID 4 and see this:
```
[A@1] [ ? ] [B@2]
 L:4    ?    L:4
```
And not know that A won. Indeed, A definitely didn’t win here, because B was set with a higher lock.

Now we know how to implement set. Given this, get is easy: Just try to read all three computers, and if you see the same value set at the same lock ID on two of them, you know that’s definitely the value that was set. If not, the value is uncertain.

This algorithm is called Paxos³⁴ (specifically, “single-decree Paxos” – because it only decides a single value – or “the Synod protocol”). A few more notes:

A lock ID may include the client ID. The original Paxos protocol does this, and although it isn’t necessary for the correctness of the above, it’s required for some variants (see “Flexible Paxos” below).
Of course, three computers tolerating the failure of up to one isn’t the only option. The critical property is that, when we read/lock computers, we must read enough to see at least one copy that was previously written successfully. Paxos operates on a set of computers, and requires reads and writes to succeed on a “quorum” of them, where quorums are defined such that any two quorums must intersect. This can be generalized in useful ways, discussed below.
Paxos usually refers to “lock IDs” as “voting rounds” or “ballots”. I think locks are a more intuitive way of thinking of them, especially since a majority isn’t required to write successfully.

Here’s the algorithm described in more detail:

To set(x):
1. Pick a lock ID L.
2. Send “lock_and_read(L)” requests to all storage machines or at least a quorum.
3. If a quorum of reads hasn’t successfully given you the lock, abort or retry. (You may use a lock ID higher than the highest one you’ve seen for future set calls.)
4. Otherwise, you now have lock L.
5. Check if there may be an existing value: Among all the successful reads, if any of them contains a value, it may have already been chosen. Take the highest one you see, and switch to writing that instead of x.
6. Send “write(L, chosen_value)” requests to storage machines.
7. If a quorum of writes was successful (responded with a write_accepted message), the set has succeeded.
8. Otherwise, abort. (If your write messages timed out, you may have been successful, but you’re not sure. If they got rejected, you can figure out what the situation is from rejection messages, and possibly retry.)
To get():
1. Send a read request to all storage computers or at least a quorum.
2. If a quorum has the same value written with the same lock ID, accept it.
3. Otherwise, abort. The value is uncertain. (You may try to write down a nop or one of the existing values if you want it to be certain.)
Storage computer (acceptor) algorithm:
- Persistent state:
  - Granted lock ID (if any)
  - Accepted value (if any) and what lock ID was used to write it
- To handle a lock_and_read(L) message:
  1. If L is higher than your granted lock ID (if any), grant the lock. Send a “granted” response containing the value you’re storing and the lock ID it was written at (if any).
  2. Otherwise, reject it. (You may want to include your granted lock ID to be helpful.)
- To handle a write(L, x) message:
  1. If L is lower than your currently granted lock ID, reject the write with a “write_rejected” message. (You may want to include your state to be helpful.)
  2. Otherwise, accept the write. Write down “x@L” as your latest accepted value. If L is higher than your previously granted lock ID, implicitly grant L, by setting your granted lock ID to L. Then send a “write_accepted” message.

Logs and multi-Paxos

The above lets us implement a single write-once cell on a system of computers where any of them may fail. This is already enough to implement useful systems like databases. The standard approach is along the lines of: Write a log of commands that can be executed deterministically, and use Paxos to decide on each entry of the log (so have many independent instances of the write-once cell protocol, one for each integer index of the log, starting at 0). Then you can always figure out the state of the system: Just start at a known initial state and apply all the commands. (In practice, of course, you can keep a snapshot of “the state at position N”, so you don’t have to apply every command back to the beginning of history.)

Unfortunately, running Paxos for each entry of the log is very expensive. Even in the best case, it requires two roundtrips (one to get the lock, a second to write the value), and if there’s contention, it can take arbitrarily long (as different clients take each other’s locks over and over).

A common approach to deal with this is the following: Instead of running independent instances of Paxos – where each one has its own lock IDs – have shared lock IDs among all the instances in the entire log. Then, someone can just keep the lock after writing to one cell and use it to write to another cell. Now they can write in only one roundtrip!

Usually a lock holder is called a “leader”. They might also be one of the storage machines, though that’s not mandatory. When someone wants to write to the database, they send their request to the leader (they might first ask around to see who the leader is, if they don’t know). If the leader doesn’t seem responsive, someone else can try to take over as the leader by picking a higher lock ID. Note that, when they do, they need to do the “reading” part of the one-cell protocol for every uncertain log entry.

(In practice you’ll have engineering decisions for when to try to take over as the leader, e.g. pinging the existing leader for health checks, and using randomized timeouts for when to become the new leader if the old one failed. Any policy for this is correct as far as safety goes.)

When a node tries to become a new leader, it may know about noncontiguous log entries. For example, it might see:

0     1     2     3     4     5
 [ ? ] [ ? ] [ ? ] [ ? ] [ ? ]
 [A@1] [B@2] [C@2] [   ] [D@3]
 [A@1] [B@2] [   ] [   ] [D@3]

Since the log must be interpreted in order, it can’t use entry 4 until it knows what entries 2 and 3 are. It can address this by trying to set every earlier log entry to a “nop” command. In this case, C@2 might have been written successfully, so the leader can’t set it to a nop (it has to set it to C), so the log would end up in a state like this:

0     1     2     3     4     5
 [ ? ] [ ? ] [ ? ] [ ? ] [ ? ]
 [A@1] [B@2] [C@4] [🗙@4] [D@3]
 [A@1] [B@2] [C@4] [🗙@4] [D@3]

Bag of tricks

Paxos may be best thought of not as a single protocol, but as a framework for making consensus protocols that fit a particular situation. There’s a big “bag of tricks” that fit into this framework that have been suggested over the years. I’ll discuss a few of them; there are many others.

A general principle to keep in mind for distributed systems: Computers fail all the time, especially if you have a lot of them, but any individual computer fails pretty infrequently, so if you pick out any one computer (say, a leader), it usually makes sense to assume failure is rare, and design for the fast path. That means you should approximate a single-computer system as much as possible, and do as little extra work (like sending extra copies of data to other computers) as absolutely necessary to recover in case the leader crashes. A lot of Paxos optimizations implicitly use this principle.

Read and write quorums

Paxos has a single notion of a “quorum” of storage machines, where the critical property is that any two quorums intersect. This is stronger than necessary. The actual property we want is that, for any writer and reader, if the writer succeeded, then the reader is guaranteed to see at least one copy that that writer wrote. The notion of “quorum” can be split into separate “read quorums” and “write quorums”, used for the reading/recovery phase and the writing phase, respectively. Since in steady state you often issue a lot more writes than reads, it can be worthwhile making write quorums smaller at the expense of making read quorums larger.

Flexible Paxos⁵ discusses some things you can do here. Some examples: You could have 10 computers and allow up to 2 failures. Then declare that a “write quorum” is any set of 3 computers, and a “read quorum” is any set of 8 computers. This gives you a lot more flexibility when writing.

The definition of quorums can be more sophisticated, rather than just cardinality. For example, you can arrange, say, 15 storage machines into a grid:

    W1   W2   W3   W4   W5
R1 [  ] [  ] [  ] [  ] [  ]
R2 [  ] [  ] [  ] [  ] [  ]
R3 [  ] [  ] [  ] [  ] [  ]

And define a read quorum as any row, and a write quorum as any column. Of course any row intersects with any column. Now you only need one row + one column (7 storage machines) to be able to both write and recover. That’s not even a majority! Of course if you lose a row + column, you can neither read nor write, even if you do have a majority. But constraining your reads/writes in this way gives you more options. You can also design more complicated read/write quorum systems for specific applications⁶.

Note that, if you’re not careful, when using read/write quorum systems, lock IDs must include the name of the lock holder, rather than just being arbitrary integers. This is because it’s possible for two read quorums not to intersect, such that two people think they’ve acquired the same lock. In the grid example above, one client could get a read lock (id L) from all of R1, and another client could get a read lock (also id L) from all of R2. Then if they use the same lock to write different values to W1 and W2, both writes would succeed, leading to an inconsistency.

It’s possible to avoid this problem if you’re careful. The trick (which I haven’t seen described anywhere) is to use two kinds of write requests:

write_using_lock(L, x), which the receiver accepts if L is equal to their currently granted lock number.
lock_and_write(L, x), which the receiver accepts if L is less than their currently granted lock number, in which case they also implicitly grant L.

Using these, a client can keep track of which computers it’s locked, and – if a write is the first time it interacts with a computer – use a lock_and_write request instead of a write_using_lock request.

This solves the above problem, because even if two people lock different rows with the same choice of L – so they both think they own L – neither of them will be able to use it to write to any column successfully. (They may successfully write to individual computers, so multiple conflicting values may be written in the same round, but that’s OK since none of them can actually write a full column – and therefore acknowledge the write – in that case, so a future reader is free to pick any of the values if it sees a conflict.)

On the other hand, including the lock holder in the lock ID can also help in the case of contention, even for regular Paxos: If several potential leaders all request lock 1, they might fight each other for it so no one gets it. But if A,B,C request locks (1,A) (1,B) (1,C) respectively, then (1,C) will always win.

Committed log entries

When you’re first setting a value, copies are “provisional” because you can’t atomically send messages to multiple computers at once. But once you’ve successfully set it (with enough copies at the same lock ID), its value never change. You can write that down somewhere for convenience – “log entry N is permanently X” – and if anyone sees that, they don’t need to read a quorum to figure out the value anymore.

If you’re processing the log into snapshots of the current state, then once a snapshot is replicated enough, you don’t need to keep the log entries at all. If you know that there won’t be any requests for log entries below N, you can delete them. This helps space usage stay small, assuming the snapshots are small, rather than grow indefinitely as the log grows.

Roles

There are many different ways you can set up a consensusy system. For example, you can have one type of computer that acts as a leader, stores provisional log entries, interprets the log and stores snapshots of it, and so on. Or you can split it up: Have one set of computers that just act as leaders (normally called “proposers”), one set of computers to store provisional log entries (normally called “acceptors”), and another set of computers to interpret the log entries once they’re certain (normally called “learners” or sometimes “replicas”).

You can also split things up even more, as discussed in Compartmentalized Paxos⁷. For example, if the load on the leader (which serializes all requests) is too high, it can farm out the work of communicating with acceptors to a bunch of stateless “proxy leaders”, which send out the request to all the acceptors.

(Another thing a leader can do, which I don’t know if anyone does, is farm out the work to clients themselves. When a client wants to write a log entry, it requests a token from the leader; the token includes the next log entry index, and the ID of the lock the leader holds. At this point, in the fast path, the leader becomes an “increment an integer” service – about as little work as you can hope for from a leader-based protocol!)

Sometimes, depending on how you split things up, you can save message hops by thinking carefully about who sends messages to whom. For example, if a leader sends proposed commands to acceptors, they can immediately inform replicas that they’ve accepted the commands. As soon as a replica sees enough copies, it can process the command, without having to wait for the acceptors to respond to the leader. Various optimizations like this are possible.

Linearizability

Linearizability is the following property of a system: If you send a write request, and you get a successful acknowledgement, then every future read request anyone makes will see the results of that write (“future” means in your future light cone). In the context of a log system like the one we’ve described, this means that all future reads will be executed at a log entry after the one assigned to the write. This is easy to do if you can include a log entry index with the read request, but without that, it becomes trickier. You might think that you can just send a read request to the leader, but it’s possible that someone else became the leader without the leader knowing (e.g. if it was isolated from the rest of the system). To be sure, the leader needs to talk to a quorum of storage machines after it receives the read request, to make sure it’s still the leader (or at least to figure out what a sufficiently large log index to read from to maintain linearizability is).

Alternatively – to take some load off the leader, and/or to save a roundtrip – a client can talk to the storage machines directly, and ask them what the latest log entry they know of is (and possibly the state of the read executed at that entry). For any (read) quorum of them, take the highest log entry seen in that quorum, and it’s valid to execute the read as of that entry, without talking to the leader. This is discussed in Paxos Quorum Reads⁸.

Consecutive lock IDs

As discussed above, values written at distinct lock IDs generally have to be considered distinct, even if the values themselves are equal.

For example, A can’t be accepted in this situation (allowing up to 2 failures):

[A@2] [A@4] [A@4] [ ? ] [ ? ]
 L:4   L:4   L:4

because there might be a value written at lock ID 3, which beats A@2. So we need to rewrite all the values of A with the same lock.

However, there’s an exception to this rule, described in Consecutive Ballots ⁹. If all the values are at consecutive lock IDs, the above argument doesn’t apply, and it’s actually OK to accept the values directly. For example:

[A@2] [A@3] [A@4] [ ? ] [ ? ]
 L:4   L:4   L:4

It’s impossible for the unknown values to preempt any of the values we see, and so it’s fine to accept A directly.

(A similar trick is also used in “coordinated recovery” in Fast Paxos ¹⁰.)

Tricky lock IDs

You can use more complicated lock IDs. We discussed (n, proposer_id) locks, but you can also have (n, proposer_id, m) locks, such that the proposer can advance the lock ID (to invalidate writes issued with earlier locks) without doing a full read phase. This trick is used in Matchmaker Paxos¹¹, but may also be useful in situations where the leader has delegated writing a particular entry to someone else (as discussed in the Roles section above) and wants to make sure that write doesn’t happen when the delegatee seems to time out.

Another trick is the following: If you’re uniquely assigned the very first lock ID (because your proposer ID is the lowest, for instance), you can skip the read phase entirely, and go straight to writing a value, because you know no one could have written a value with an earlier lock ID. Effectively, the storage machines can start out “pre-locked”. This isn’t very important in a leader election situation like Multi-Paxos, because the cost of acquiring the lock is amortized over many log entries, but it can be useful if you only want consensus for a single value. One example of this is the Paxos Commit protocol, which is a generalization of two-phase commit that uses one instance of Paxos for each transaction participant’s decision¹² (round 0, pre-assigned to the transaction participant, may be either accept or abort; later rounds, used in recovery in case the participant fails, are chosen to be abort unless a participant’s previous choice of accept is visible).

Raft trick: No holes in the log

Raft is a consensus protocol very similar to Paxos (most people would probably call it a Paxos variant) that makes a few different decisions. One thing Raft does is implicitly forbid holes in the log. It uses the same idea of lock IDs, but always uses “append” requests at the leader, rather than anything addressing the log directly. This means that each computer keeps a contiguous log, and an index past which log entries are provisional:

                 ?   ?   ?
[A] [B] [C] [D] [E] [F] [G]

Once enough computers have acknowledged some prefix of the leader’s log, that prefix becomes committed. But if another computer becomes the leader and writes something else, the provisional prefix may be overwritten.

This can simplify parts of the protocol, since dealing with holes is annoying, but may reduce throughput in some cases (I think) since commands can’t be written to future log entries concurrently.

Interleaved logs

The leader is a clear bottleneck. We discussed a few ways to reduce the bottleneck above, but here’s another one: Run two entirely different instances of a consensus protocol, one for the even entries of the log and one for the odd entries. (More generally, run k different instances for each residue mod k.)

When a client inserts, it can randomly pick an instance to insert into. If one instance falls behind –

 0   1   0   1   0   1   0   1   0
[A] [B] [C] [ ] [D] [ ] [E] [ ] [F]

– someone can write nops into the empty entries to catch it up, so the later commands can be interpreted.

This type of approach is used by Mencius¹³.

Mutable cells

We described a write-once cell that can change from the starting (“empty”) state to one state, and then never change again. Then we implemented a “mutable” state by applying a log of state transitions.

If your state is small, though, it’s possible to mutate the state directly, with a CAS (compare-and-swap) operation, rather than append to a log. Simply use the “current” state (the one written at the highest lock ID out of a read quorum), apply some function to it to get a new state, and write that state out to a write quorum.

(Note: You don’t need to first write out a write quorum of the existing value! Even if, in a read quorum, you only see a single copy x at the highest lock ID, you can go straight to writing f(x).)

Regular single-cell Paxos can be seen as a special case of this, where the transition function is always of the form “f(empty) = A; f(x) = x”. The details are described in CASPaxos¹⁴.

Avoiding consensus

There’s an important distinction between consensus and regular replication. In many ways they’re similar – consensus requires replication for safety – but if there’s no concern about disagreement about your data, your life can be much easier. For example, if you’re making a content-addressed storage system – where keys are cryptographic hashes of their values, so for a given key, only one value can ever exist – and you have a bunch of storage servers, you can just write your data to a few servers and be done, with no risk of conflict. Similarly, if you use a long (256-bit) random string as a key, there’s no risk of conflict, so you can just write some copies of it without consensus.

If you want to write a large command, it may not be a good idea to pass it through your consensus system, which may be relatively inefficient and hard to scale. Instead, you can first write it to a consensus-free blob store (which might just mean writing copies to the nodes you’re planning to send the commands to) – using a hash or random ID as a key – and, once it’s durable, just write the ID to your consensus system. This way your consensus commands can stay small.

Reconfiguration/membership changes

A surprisingly tricky part of this sort of protocol is reconfiguration, or membership changes. If you have a cluster of some machines, and one of them permanently fails, you probably want to replace it. You may also want to add or remove machines for other reasons, or change other parameters. This is a surprisingly subtle issue which is often glossed over; many “obvious” approaches have flaws when you think about them carefully.

I’ll probably cover some approaches in the future.

Avoiding leader contention

As discussed above, the leader is a bottleneck in making a high-performance distributed system. More generally, the idea of a single log that needs to be interpreted in order fundamentally restricts performance to no better than single-threaded. Using log entries mod k as discussed above allow for multiple leaders, but the system is ultimately still single-threaded, since the replicas interpreting the log have to apply commands in order.

A common solution for increasing performance involves sharding data (e.g. by key range or by hash). Then each shard can still have a single log, but the system as a whole can run on many computers independently. This introduces other complexities, like not necessarily having a single snapshot of the entire system’s state, and not being able to easily operate atomically on multiple shards. This is usually addressed with distributed transaction systems, which are a very interesting topic beyond the scope of this post.

There are also “leaderless” protocols, often relying on the observation that totally ordering the log is too strong – for example, if each log entry sets an individual key, entries setting different keys may be reordered, but entries setting the same key must be kept in order. These protocols tend to be more complex than leader-based protocols. I may cover some of these approaches in the future.

If these properties are violated, we get an inconsistent system. Note that, as stated here, “get always fails” is a valid implementation. We’d like to rule it out in some situations, by saying things like “the system never gets into a state where it’s impossible for set/get to ever succeed again, assuming enough computers are up” or “at least one set call will definitely succeed eventually”, but it’s a bit tricky to say exactly what the guarantee we want is. Since this is a pretty informal treatment, I’ll leave it as stated here. These are called “liveness” properties, and they’re also important, but we’re treating inconsistency as much worse than downtime.
Also, liveness considerations are in practice more of an engineering decision – we tend to care a lot about the specific paths that lead to a successful outcome, and that they’re fast, not just that it’s reachable – whereas we’d like safety to be absolute. ↩︎
Distributed systems are harder to make than non-distributed systems, so there are three main reasons you’d want one:
1. Your system is fundamentally distributed (e.g. a multiplayer game, or chat system)
2. You want to increase reliability
3. You want to increase performance (beyond what one computer can do)
Consensus is, for the most part, entirely about reliability; most of what’s discussed here is about using multiple computers to simulate a single computer, usually with worse performance, but we gain the ability to tolerate individual computer failures. ↩︎
The Part-Time Parliament – Leslie Lamport ↩︎
Paxos Made Simple – Leslie Lamport ↩︎
Flexible Paxos: Quorum intersection revisited – Heidi Howard; Dahlia Malkhi; Alexander Spiegelman ↩︎
Read-Write Quorum Systems Made Practical – Michael Whittaker; Aleksey Charapko; Joseph M. Hellerstein; Heidi Howard; Ion Stoica ↩︎
Scaling Replicated State Machines with Compartmentalization – Michael Whittaker; Ailidani Ailijiang; Aleksey Charapko; Muray Demirbas; Neil Giridharan; Joseph M. Hellerstein; Heidi Howard; Ion Stoica; Adriana Szekeres ↩︎
Linearizable Quorum Reads in Paxos – Aleksey Charapko; Ailidani Ailijiang; Murat Demirbas ↩︎
Brief Announcement: On the Significance of Consecutive Ballots in Paxos – Eli Goldweber; Nuda Zhang; Manos Kapritsos ↩︎
Fast Paxos – Leslie Lamport ↩︎
Matchmaker Paxos: A Reconfigurable Consensus Protocol – Michael Whittaker; Neil Giridharan; Adriana Szekeres; Joseph M. Hellerstein; Heidi Howard; Faisal Nawab; Ion Stoica ↩︎
Consensus on Transaction Commit – Jim Gray; Leslie Lamport ↩︎
Mencius: Building Efficient Replicated State Machines for WANs – Yanhua Mao; Flavio P. Junqueira; Keith Marzullo ↩︎
CASPaxos: Replicated State Machines without logs – Denis Rystsov ↩︎
How to Build a Highly Available System Using Consensus – Butler W. Lampson¹⁵ ↩︎