Repair

Introduction

Cassandra keeps several replicas (replication factor) for each key range, on several different machines. These replicas are in danger of becoming slightly out of sync with each other, e.g., when one replica is briefly unavailable while writes are done to the other replicas, or when a transient network problem caused a write not to reach one replica, despite being ready and willing to accept requests.

Cassandra employs three mechanisms to solve this problem:

Hinted Handoff: Another node remembers for an unavailable node the new data it lost, and sends it again when the node becomes available again.
Read Repair: For a percentage of the queries, Cassandra sends the query to several replicas, and if they send different responses, we determine the most up-to-date data (using the timestamps of data and tombstones) and repair the others. The actual implementation is more elaborate - only one replica sends a full response and the others (initially) just send digests - see http://wiki.apache.org/cassandra/ReadRepair for more information.
Repair (a.k.a. Anti-Entropy Repair, or Node Repair) - the replicas compare their full data (or parts of it), and the most up-to-date version of each piece of data is reconciled and saved at each replica. While the previous approaches can fix most problems, this third mechanism can solve problems the first two can't. This repair mechanism is the topic of this document.

When to repair

In Cassandra, repairs are not done automatically - they need to be done manually (or scripted) using nodetool. However, the Cassandra best-practice manuals (see this) recommend running a repair at regular intervals.

One of the reasons why routine repair is important is an undesirable side-effect of repair - "resurrection" of deleted data: Cassandra keeps tombstones for deleted data, but deletes those tombstones after some period of gc_grace (e.g., 1 week). If we do repairs more rarely than 1 week apart, it is possible that some replicas deleted data but already erased the tombstone, but one replicas missed the deletion and still has the live data. In that case, the repair process (not seeing the tombstones which have already been expunged) will "resurrect" the already-deleted data. For this reason it is important to do repairs at least more often than gc_grace.

Another case where routine node repair is recommended is for write-mostly data: Read repair can often repair problems, but as its name suggest, only happens for data which is read frequently.

Routine repair is not urgent, and can be scheduled to low-usage hours.

Node repair is also recommended on a node that comes up after having been down for a longer period than Hinted Handoff normally keeps hints for.

How repair works

Cassandra's repair mechanism is based on the one described in the Amazon Dynamo paper

When repairing a column family (table), Cassandra builds a Merkle tree for the data in each replica.

Building the trees is done only at time of repair, and is a fairly lengthy and I/O intensive process similar to compaction - the so-called "validation compaction" is basically the read part of a "major compaction" (where all the sstables are compacted). Each repairs involves two major compactions: The first to build the tree ("validation compaction") and the second to read the disagreeing ranges. [TODO: this doesn't sound right to me -if there is little disagreement, the second pass shouldn't need to read the entire sstables]

In Cassandra, different nodes do not replicate exactly the same set of data, but rather each node holds a set of key ranges, and each key range might be replicated on a different set of nodes. So repair happens for each key range separately: the initiator finds the set of peers holding replicas of this range, tells each of them to prepare a Merkle tree, and finally compares them, and when finding a difference, it tells the conflicting nodes to exchange the data in the conflicting key range.

As a Cassandra uses a Merkle tree of fixed size - 2^15 = 32768 leaf nodes. This happens because the Merkle trees don’t have infinite resolution and Cassandra makes a tradeoff between the size and space. Currently, Cassandra uses a fixed depth of 15 for the tree (32K leaf nodes). For a node containing a million partitions with one damaged partition, about 30 partitions are streamed, which is the number that fall into each of the leaves of the tree. Of course, the problem gets worse when more partitions exist per node, and results in a lot of disk space usage and needless compaction.

Repair options

nodetool repair has several options that determine what gets repaired, and how:

-pr (--partitioner-range): repair only the primary partition range on this node. Useful for running routine repair on each node on separate schedule - so we end up repairing each partition range once.
sequential repair (the default): When a range has three replicas, don't do the repair on all three at once, but rather take a "snapshot" (hard link) of the three, and then perform the repairs in pairs. The benefit is that at any time, one of the three replicas is not busy doing the repair - so it maintains its top performance, and the dynamic snitch can divert most requests to it.
parallel (-par)
incremental repair
explicit range (-st <start-token> -et <end-token>).

Incremental Repair

http://www.datastax.com/dev/blog/more-efficient-repairs http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_repair_nodes_c.html

The relevant code

See also http://wiki.apache.org/cassandra/ArchitectureAntiEntropy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly