Skip to content
Martin Sucha edited this page Feb 17, 2020 · 4 revisions

Token

What is a token

Token is the hash value of a partition key. This hash value is produced by the partitioner, a hash algorithm, and is used to determine which node or nodes the partition is stored on. Scylla uses MurmurHash3 as its partitioner algorithm. Scylla's partitioner hashes each partition key into a signed integer of 64 bits, so tokens are in the range of -2**63 .. 2**63 - 1. Multiple partition keys may share the same token, as there is a limited number of token values to which a theoretically unlimited number of partition keys have to be mapped.

How are tokens distributed among nodes

Token is used to split a ring into multiple ranges. Each and every Cassandra node will be responsible for one range or multiple ranges, if vnode is disabled or enabled respectively. We will explain vnode shortly.

For example, assume we have a ring with size 16 and the token range is [-8, -7, ..., 0, 1, ..., 7]. There are two nodes in a cluster. Node A has a token -7 and Node B has a token 1. In this case, Node A will be responsible for range (1, -7] that is tokens [1, 2, 3, 4, 5, 6, 7, -8, -7] and Node B will be responsible for range (-7, 1] which corresponds to tokens [-6, -5, -4, -3, -2, -1, 0, 1].

The token values in system.peers or system.local tables are the end value of the range.

So the token range belonging to a node X is defined as starting from the token of the preceding node in the ring (exclusive) until the token of node X (inclusive).

How are tokens assigned to a node

Before vnode is introduced, user needs to assign a token to a node manually, through the initial_token config option in cassandra.yaml.

With vnode, user is not required to assign tokens manually. Instead, user specify number of tokens a node will be assigned to through the num_tokens config option in cassandra.yaml.

But what does num_tokens actually mean? How does Cassandra automatically choose the tokens to make sure each node holds similar length of token ranges so that workload is distributed to nodes evenly? What if a new node is more powerful than others and we want to distribute more workload to it?

It seems magic but actually the idea is pretty simple.

  1. First, the number of tokens assigned to a node is configured manually.

If you want to distribute more workload, assign more tokens to it. E.g., by default 256 are assigned, you can assign 512 to a powerful node.

  1. Second, each token is selected randomly.

This means there is no central place to decide which node gets which tokens. But how is it possible we can choose tokens randomly and have each node have similar length of token ranges. Let's say we have num_tokens equals 2 and have a ring size 2^64, that is from [-2^63, 2^63).

Node A: t1, t2

Node B: t3, t4

So, this 4 token (points) will split the ring (circle) into 4 ranges. Each node will have 2 ranges in totally. Since t1-t4 are choose randomly, we'll have a very big chance that Node A and Node B will be responsible for very different size of ranges.

However, if we increase the num_tokens, the probability of ranges being split evenly increases! Don't make your num_tokens too small.

We have one more issue to address. We know each node will choose tokens randomly and independently. What if two nodes happen to choose a same token. This is avoid when a node joins a ring, it will first know other nodes's token. When it generates tokens randomly , it will skip tokens already used by other nodes.

How token information is shared in a cluster

Nodes in a cluster know the node token and node ip mapping information through Gossip. There is an application_state::TOKENS. Local node injects its token information into Gossip and knows other node's token information.

Clone this wiki locally