Skip to content
Engel A. Sanchez edited this page Jun 23, 2014 · 10 revisions

Also see the edoc for riak_ensemble.

Overview

Strongly consistent operations in Riak are implemented using the riak_ensemble library (see its Git Repo and Wiki for more details). These page describes the integration points with Riak KV. The main ones are the riak_kv_ensembles module that manages ensembles in the cluster, and the riak_kv_ensemble_backend module that connects the ensemble peers to KV vnodes and the AAE system.

NOTE: Integration points in riak_core make sure that riak_ensemble knows about nodes that join or leave the cluster

Managing Ensembles

See riak_kv_ensembles

An ensemble is created for each primary preference list. Unlike eventually consistent Riak, fallback vnodes never participate in consistent operations. For example, if there are buckets with a replication factor (n_val) of 2 and some with 3, two ensembles will be created for each partition of the ring. For partition zero, for example, we will have {0, 2} and {0, 3}.

The riak_kv_ensembles process polls the ring for changes at regular intervals (currently every 10 seconds). This happens only on the claimant node.

Consistent operations

Consistent operations are handled by riak_client, just like regular operations. If the bucket as the {consistent, true} property, it will be routed through the riak_ensemble_client. Legacy buckets (that is, in the default bucket type) can not be consistent (see validate_update_consistent_props/2 and validate_update_bucket_type/2 in riak_kv_bucket).

Consistent objects use the riak_object too, but the vector clock information is fake. It actually stores the epoch and sequence number used by riak_ensemble to version operations encoded as a 64 bit integer as the count for a a fake eseq actor id. Earlier versions would use overwrite semantics for the ensemble put operation if the input object did not have a vector clock and update if it did. Joe decided that it was too common for clients to send an object without a vector clock, potentially causing unsafe overwrites when not intended. Now, most writes are treated as updates, and will fail if the object has been modified since the client read it. Passing the if_none_matched option will cause the write to fail if a value already exists for that key. See riak_client:consistent_put_type/2.

Delete operations are similar. Riak client deletes map to a unchecked (similar to overwrite) ensemble delete. Whereas client delete_vclock operations map to safe deletes that are similar to modify. Specifically, the safe variant is used in the sequence: Obj = get(), delete(Obj) where Riak fails the delete if the object has changed since the get.

Ensemble backend implementation

The riak_kv_ensemble_backend module takes care of a few things:

  • Translate gets/puts to ensemble_get/ensemble_put messages to the riak_kv_vnode (see get/3, put/4)
  • Periodically update ensemble membership if ring ownership changes require it. (see riak_kv_ensemble_backend:tick/5)
  • Handle sync requests, replying when all possible AAE exchanges between the index/N partitions corresponding to the ensemble peers have recently finished (see sync/2, wait_for_sync/6)

Support in riak_kv_vnode

Messages to vnodes are all casts containing a From address so a reply is sent directly to the client. So check out the clauses in handle_info for:

Notice that there is no special handling of folds, so those do not have any strong consistency guarantees. Writes that failed to reach a quorum may appear on a fold.

Anti-entropy on consistent data

The purpose of peer syncing is to detect and repair corrupted/missing ensemble backend data.

For riak_kv_ensemble_backend, syncing is implemented using AAE. Anytime a vnode is restarted it is considered untrusted until the vnode performs AAE exchange with a majority of sibling vnodes.

To support AAE syncing, riak_kv_vnode was extended with a new ensemble_repair command and the riak_kv_exchange_fsm was extended to use this command to repair consistent keys. This change is needed because consistent data does not yet support read repair. Likewise, even when consistent data supports read repair, it will only do so when a given ensemble is stable. However, syncing is required to operate even when an ensemble is not stable (eg. peers may be required to sync before they can proceed to elect a leader).

Notice that failed SC repair operations kill the exchange, preventing a sync operation, and currently while the vnode is forwarding, repairs are not forwarded. They just fail.

Known limitations/issues

  • Consistent data does not currently support secondary indexes.
  • Consistent operations do not yet fire any stats-related events, nor keep any stats of their own.
  • Consistent operations currently bypass the overload protection system and need their own overload protection to be added.
  • Deleting consistent data leaves around tombstones that are not currently harvested.