Collision ratio comparison

Collision study

The tool provided in tests/collisions is able to directly probe the collision efficiency of 64-bit hashes, by generating a massive amount of hashes, it then counts the number of collisions produced. The test requires a large amount of RAM, and can last a while.

The generated input patterns are relatively close to each others (low hamming distances), with only a few bits of difference. This use case is expected to be challenging to low quality hashes.

The test result indicates the expected nb of collisions for the amount of generated hash values. The tested hash algorithms do not have to produce a strictly identical nb of collisions. Being in the vicinity is good enough. For example, with 100G 64-bit hashes, the expected average amount of collisions is 312.5, but anything between about 280 and 350 collisions is statistically indistinguishable from expectation, so they are all "good", and can't be ranked solely on this information.

Testing 64-bit hashes on large inputs :

An input of 256 bytes is presumed to always trigger the "long" mode for hash algorithms featuring multiple modes depending on input length.

Algorithm	Input Len	Nb Hashes	Expected	Nb Collisions	Notes
XXH3	256	100 Gi	312.5	314
XXH64	256	100 Gi	312.5	294
XXH128 low 64-bit	256	100 Gi	312.5	314
XXH128 high 64-bit	512	100 Gi	312.5	291
blake2b low 64-bit	256	100 Gi	312.5	296	very long test : 48h !
md5 low 64-bit	256	100 Gi	312.5	301	very long test : 58h !
sha256 low 64	256	25 Gi	19.5	14	very long test
city64	256	100 Gi	312.5	303
fnv64	256	100 Gi	312.5	303	long test : 23h30, compared with 3h20 for `XXH3`
mmh64a	256	100 Gi	312.5	310
highwayhash64	256	100 Gi	312.5	321	very long test : 35h !
t1ha2	256	100 Gi	312.5	761	a bit too many, loses ~1bit of distribution
wyhash	256	100 Gi	312.5	1202	a bit too many, loses ~2bits of distribution
mumv2	256	100 Gi	312.5	1229	a bit too many, loses ~2bits of distribution
fletcher4	256	1 Gi	0.0	404908304	is that a hash ?

The last part of the table contains hash algorithms with less-than-ideal collision probabilities. t1ha2's distribution is closer to 63-bit, while mumv2 and wyhash are closer to 62-bit. Such level of inefficiency is not "horrible", these algorithms can still be considered "reasonable" for many practical usages. Below that, on the other hand, listed algorithms should not be considered for any scenario.

Testing 64-bit non-portable hashes :

AES hashes are based on Intel's AES-NI instruction extension, introduced alongside SSE4.2. These algorithms can't run on different platforms.

Algorithm	Type	Input Len	Nb Hashes	Expected	Nb Collisions	Notes
aquahash low64	AES	256	100 Gi	312.5	3488136	unusable
falkhash low64	AES	256	100 Gi	312.5	18430	not good
meowhash v0.4 low64	AES	256	100 Gi	312.5	18200	not good
ahash	AES	256	100 Gi	312.5	758	okay (loses -1.5 bit dist.)
meowhash v0.5 low64	AES	256	100 Gi	312.5	300	fine

AES implying cryptographic properties, one could believe that it's going to be good enough for simpler properties such as dispersion and randomness. This assumption appears incorrect, and AES do not offer any such guarantee "by default". The interesting Meowhash example shows that, with proper care, an AES-based hash can provide the expected amount of collisions as predicted by the birthday paradox. But such outcome should not be considered "trivial" to achieve, and one would be well inspired to consult an encryption expert in order to produce a good quality hash using AES instructions.

Testing 64-bit hashes on small inputs :

The test on input 8 bytes => 8 bytes output can check if there is some "space reduction" happening at this length. If there is no space reduction, then the transformation is a perfect bijection, which guarantees no collisions for 2 different inputs.

Algorithm	Input Len	Nb Hashes	Expected	Nb Collisions	Notes
XXH3	8	100 Gi	312.5	0	`XXH3` is bijective for `len==8`. It guarantees no collision.
XXH64	8	100 Gi	312.5	0	`XXH64` is also bijective for `len==8`
mmh64a	8	100 Gi	312.5	0	bijective
city64	8	100 Gi	312.5	278	not bijective, but an acceptable collision ratio
mumv2	8	100 Gi	312.5	544	a bit too large, loses ~1bit of distribution
t1ha2	8	100 Gi	312.5	607	a bit too large, loses ~1bit of distribution
wyhash	8	100 Gi	312.5	652	a bit too large, loses ~1bit of distribution
ahash	8	30 Gi	28.1	3841	pretty bad, loses ~7bit of distribution

Other lengths :

Algorithm	Input Len	Nb Hashes	Expected	Nb Collisions
XXH3	16	100 Gi	312.5	332
XXH3	32	100 Gi	312.5	287
XXH3	1024	100 Gi	312.5	300

Testing 128-bit hashes :

The only acceptable score for these tests is always 0. If the hash algorithm offers 128-bit of dispersion, the probability for a single collision to show up is smaller than winning the national lottery twice in a row. In contrast with the 64-bit tests, due to resource limitation, the test does not provide a precise 128-bit collision estimation. It merely identifies algorithms with < 80-bit of effective distribution.

Algorithm	Input Len	Nb Hashes	Nb Collisions	Notes
XXH128	256	100 Gi	0
cityhash128	256	100 Gi	0
t1ha2 128-bit	256	100 Gi	14	distribution equivalent to ~70 bits

Tests on small input lengths

Algorithm	Input Len	Nb Hashes	Notes
XXH128	16	25 Gi	range 9-16
XXH128	32	25 Gi	range 17-128
XXH128	100	13 Gi	range 17-128
XXH128	200	13 Gi	range 129-240
XXH128	1024	100 Gi

ietf-logo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly