Skip to content
Florian Forster edited this page Nov 21, 2023 · 1 revision
Name: Hashed match
Type: match
Status: supported
FirstVersion: 4.9
Copyright: 2009 Florian octo Forster
License: GPLv2
Manpage: collectd.conf(5)
See also: List of Matches

The Hashed match calculates a hash value of the host part of the identifier and uses it to assign values to disjoint groups.

Synopsis

 <Chain "PreCache">
   <Rule>
     <Match "hashed">
       # There are six groups, groups 0–5.
       # Write groups #0 and #3 on this host.
       Match 0 6
       Match 3 6
     </Match>
     # Return and continue
     Target "return"
   </Rule>
   # Default target: stop processing immediately
  Target "stop"
 </Chain>

Description

The Hashed match is meant for doing load-balancing, i.e. writing each value on one or two hosts only, rather than writing every value on all hosts.

In the synopsis above, there are the following two lines:

 Match 0 6
 Match 3 6

This will calculate a hash value of the hostname modulo six, resulting in a number between zero and five, inclusively. This range is controlled by the second argument.

The first argument tells the plugin to consider the hostname to be a positive match, if the calculated value equals this argument. So if the value calculated from a hostname is zero, the first line will tell the plugin to signal a match.

If multiple Match options are configured, they are combined by a logical “or”. So in the example above the plugin will match if the computed value is either zero or three.

In the synopsis, the rule will tell collectd to continue processing the value using the return target when the hashed match returns a match. Otherwise, the value is ignored using the stop target. This means that about ⅓ of all plugins is processes and ⅔ are ignored.

Uniform distribution

The hash function tries to distribute hosts into the given groups in a uniform matter, i.e. it tries to put a similar amount of hosts in each group. To do this, it calculates a hash value like this:

 hash value = 0
 for each byte in hostname
   hash value = (hash value * 2184401929) + byte  (mod 2^32)

2,184,401,929 is just a large prime number. The appropriate group number is calculated using modulo, for example:

 3075075026 mod 6 = 2

The resulting unsigned 32 bit integer hash value will be sufficiently pseudo-random to distribute the hosts evenly over the configured groups.

(That prime number was chosen randomly and proved to work well for a realistic set of hostnames. Any other (reasonably large) prime number should work, too.)

At the same time, the value is calculated deterministically, so that all nodes will compute the same number independently from one another. So the host will end up in the same group on all hosts.

Organizing groups

For a simple load-balancing it is enough to handle each group on exactly one host. Lets say you want to split the data up into three groups to store on three hosts. To do that, you can use the following three configuration snippets:

 # Host 0
 Match 0 3
 # Host 1
 Match 1 3
 # Host 2
 Match 2 3

Now, the plugin on each of these nodes will match about ⅓ of all incoming data.

Of course, very soon you will want to optimize for data loss: If one server goes down, one third of you data will be ignored by the other two servers and will thus be missing. One way around that is to store each value on two hosts, thus introducing a bit of redundancy. To configure this, add a second group to each server:

 # Host 0
 Match 0 3
 Match 1 3
 # Host 1
 Match 1 3
 Match 2 3
 # Host 2
 Match 2 3
 Match 0 3

Now each server stores approximately ⅔ of all data. Even when one server goes down, no data is lost because it's stored by one of the other servers.

Dependencies

  • none
Clone this wiki locally