Constraining writers in distributed systems

This is a note to describe a pattern for avoiding failures in distributed systems, variants of which has been written about multiple ways with multiple names (“copysets”, “read/write quorum systems”).

Say we have a distributed storage system with N nodes that can store data, but some subset of them might fail (though the failure probability for any individual node is low). We want to use the nodes to store a large number of files. For each file, we can choose how to write it across the nodes.

In order to tolerate node failures, we use redundancy – writing extra data to be able to recover files even when some nodes fail.

By choosing a strategy for how to write data, we implicitly define the failure behavior of the system. Let’s explore some situations and strategies that we can use.

For simplicity, our failure model is that nodes fail independently and uniformly. (This is not a very good model for real systems, and avoiding correlated failures is very important! Some of the ideas here can also help with that.)

The most basic way to add redundancy to the system is simple replication: When we want to write a file, we write it to a randomly chosen set of k nodes. Then losing any k nodes is necessary and (assuming a large number of files) sufficient to lose some data.

Special cases: k=1 means we just write a single copy, and losing any node is enough to lose data; k=N means we write each file to all nodes, and we have to lose all of them to lose data.

Say we want to design the system to handle up to 2 simultaneous node failures. Then we can use k=3 simple replication.

The problem with this strategy is that, as N grows, our probability of data loss goes up (since more nodes can fail). We want to have a lot of nodes to increase capacity and throughput, but we don’t want to lose a lot of reliability for it.

This is kind of weird! It means that, with the above strategy, one large data center has a much higher failure probability than several small data centers. The reason is that any particular file will never span multiple data centers, and therefore each data center can lose some nodes by itself. But of course we don’t need physical data centers to use this trick – we just need to restrict the way we write files.

A simple way to address this is copyset replication¹: Instead of writing to any k nodes, decide on some permitted subsets (called “copysets”) upfront, and randomly choose one of those subsets. Then, if k nodes fail, they must be exactly one of these sets in order to actually lose any data.

The fewer sets we have, the lower the chance of loss, so the lowest failure probability is if all the sets are disjoint. Then, if k nodes fail, we have a probability of (N/k)/(N choose k) that we lose data.

For example: Say we have N=60 nodes and we write k=3 copies. Then, if 3 nodes fail, there are (60 choose 3) = 34,220 possible ways they might fail, but only 20 of them will lead to data loss, so we’ve reduced the failure probability to 20/34,220 = ~0.06% of what it was. Big improvement!

(An important note about these strategies: They reduce the failure probability, but not the expected amount of data loss. That is, instead of a large probability of a relatively small amount of data loss (only the relatively few files that happened to span the lost nodes), we have a small probability of a catastrophic loss (in the above case, every single file on the lost nodes).

For a given number of copies, no choice of strategy can affect the expected number of files lost – this is easy to see by linearity of expectation, since the behavior of any single file written to k nodes is independent of the strategy used across multiple files.)

These sets don’t have to be disjoint: For example, if k doesn’t divide N, then we can’t make exactly N/k disjoint sets. There are also practical reasons to want more than the minimum number of sets. One is that, when a particular node fails – so we only have k-1 copies of the data – we want to quickly copy all the failed data to new nodes, so these files are written to one copyset again (if we wait too long, we might have more failures, and lose every copy of these files). If the lost node belongs to only a single set, that means there are only k-1 nodes that we have to read all the data from, and we may be limited by their read throughput. If a node belongs to multiple sets, then we can read lost files from all the nodes in those sets.

The copysets paper ¹ calls the number of other nodes that any particular node shares its data with “scatter width”. Increasing scatter width is at odds with reducing failure probability. It proposes a simple algorithm for generating copysets that lets you choose a point along this tradeoff.

I’m going to discuss a few more variants of this idea – that by restricting how we write data, we can get reliability benefits.

The first is almost the same idea, but often framed differently, in the context of quorums.

There’s an implicit assumption in reasoning like the above: That we already know which files we have, and what nodes they’re written to – we just don’t know their contents. Therefore, when we lose some nodes, we know what files were on them, and we can figure out whether we lost any data, and how to recover.

This makes sense if, for instance, you write a file in two steps: First write some copies of the file itself to some nodes. When that’s done, write a small record to a database, saying where to find the copies.

But what if you don’t know where the files are? This is important, for instance, if you want to implement the database itself.

The standard approach for doing this involves quorums. Rather than having a database that tells you where each file is, you have to look at the nodes themselves to figure out which files have been written.

This can be used for consensus systems like Paxos (as described below), but let’s use a simpler example: We want to store a bunch of files in a content-addressed database keyed by a cryptographic hash of their contents.

We want to write some files and be able to guarantee that we can read them later. When we read, some nodes may be slow or unavailable. How do we make sure that we find files that have been written?

We can use copysets as described above. To write a file, we just write it to any copyset. We don’t tell the user that the write is complete until then (but note that we may be interrupted in the middle – we’ll handle this below).

We need to make the following promise: If you’re ever told that a file was written successfully (either by an acknowledgment of a write, or by reading it), then all future reads will also return that file successfully.

In order to implement that, as a reader, we need to make sure that if a file has been fully written to any copyset, we see it. We do that by looking at at least one node of each copyset. If none of these nodes have the file we’re looking for, we can be certain that it’s never been written successfully.

Important note: If one of these nodes does have the file we’re looking for, we can’t just return it immediately. It’s possible that it was written to one node but not an entire copyset; a different reader in the future might not look at this node, which would violate our safety requirement that once you see a file it never disappears. Therefore you need to make sure that the file is fully written to any copyset (not necessarily the one you saw it in, since it might be that a node in it is unavailable, which would make it impossible to tell – in that case you’re free to write it to another copyset).

Flexible Paxos² describes this in slightly different terms, using more symmetric “read/write quorums”. Copysets as described above correspond to “write quorums”. The “disjoint copysets” approach makes a quorum system similar to the “grid quorums” in that paper, but note that what we have here is more flexible – instead of reading a specific row of the grid, we just have to read one row from each column.

(Side note: You can also describe Fast Flexible Paxos³ in these terms, which I think is much more natural than the read/write quorum system description. I may write more about Fast Paxos in the future, but if you’re already familiar with it, the short version is:

Since fast rounds can complete in a single message, if there are any two competing fast proposers, they can’t both succeed; in order to ensure that, their quorums have to overlap. So every pair of fast quorums must overlap in at least one node. As a reader, you have to check these overlapped nodes to figure out who won. So, for each pair of fast quorums, you have to read at least one node from their intersection.)

Another variant of this idea involves erasure coding. Most practical distributed storage systems don’t want to write k copies of their data, because this leads to low space efficiency: To handle a 2-node failure you need to write 3 copies of each file, which means 33% space efficiency.

Simple linear block erasure codes give you the following (space-optimal) behavior: You split your file into D data chunks, and generate P additional “parity” chunks from them. You write one chunk to each node. Then you can use any subset of D chunks to reconstruct the entire file.

We’ll call this a D+P code (note that 3-way replication is a 1+2 code).

Erasure codes can be a huge improvement in space efficiency: For example, instead of 33% space efficiency, a 4+2 code gives you 66% space efficiency – effectively doubling the amount of disk space you have while still allowing any 2 nodes to fail without losing data. There are diminishing returns as you use more chunks, of course – 8+2 gives you 80% space efficiency, 18+2 gives you 90%, and so on.

In the context of reducing failure probability, we can see some disadvantages compared to replication. Instead of just discussing “copysets”, we now need a slightly more refined notion, which I’ll call “failure sets”:

A failure set is a set of nodes such that, if every node in it is lost, data can be lost.

Note that each D+P set of chunks produces (D+P choose P+1) failure sets. (Other kinds of erasure codes, such as non-MDS “locally repairable codes”, can produce a more complicated pattern of failure sets.)

This is a much larger number of failure sets than we got with replication, which is kind of unavoidable. A saving grace is that, if two erasure-coded sets of nodes overlap, we only pay for their shared failure sets once. This gives us an incentive to have a lot of overlap.

There are several schemes that you can use here. Hydra⁴ describes “CodingSets”, which is a fairly natural extension of copysets, designed to have a lot of overlap: Instead of using sets of size D+P, use larger sets of size D+P+L for some L, where each node is in exactly one of these sets. Then, for each write, choose a random D+P subset within the larger set to actually write to.

The approach here is, in the end, a sort of protocol design: Constraining the behavior of one system participant in order to give more flexibility to another participant.

I’m curious about the following:

What’s the first mention of these ideas – either quorum systems that require overlap between two kinds of participants, or constraining write placement to reduce failure probability? I found ⁵ from 1986 which describes using “Complementary quorum sets” for transaction commit/abort, but I suspect that the idea is older. I similarly suspect that constraining write placement is something that many people have done independently.
What are other situations where this kind of pattern comes up?

Copysets: Reducing the Frequency of Data Loss in Cloud Storage.pdf – Asaf Cidon; Stephen M. Rumble; Ryan Stutsman; Sachin Katti; John Ousterhout; Mendel Rosenblum ↩︎ ↩︎
Flexible Paxos: Quorum intersection revisited – Heidi Howard; Dahlia Malkhi; Alexander Spiegelman ↩︎
Fast Flexible Paxos: Relaxing Quorum Intersection for Fast Paxos – Heidi Howard; Aleksey Charapko; Richard Mortier ↩︎
Hydra : Resilient and Highly Available Remote Memory – Youngmoon Lee; Hasan Al Maruf; Mosharaf Chowdhury; Asaf Cidon; Kang G. Shin ↩︎
Mutual exclusion in partitioned distributed systems – Daniel Barbara; Hector Garcia-Molina ↩︎