Any known xxhash collision so far? #165

hiqbn · 2019-01-20T09:39:52Z

Any known xxhash collision so far? Need some data for reference.

Cyan4973 · 2019-01-20T16:54:42Z

xxhash is a non-cryptographic hash algorithm.
It must be assumed that collisions can be produced on demand.
32 and 64 bits are way too small anyway to provide any protection, and can be brute-forced.

hiqbn · 2019-01-21T02:53:29Z

I would like to know what values have this collision. does anyone have the values? as mentioned, need data for reference.

ramytarchichy · 2019-02-12T21:09:25Z

It's quite trivial to find some:

//add this to your compiler flags to run it faster: -O2 -march=native

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include "xxhash.h"

int main()
{
    const unsigned long long t = 1727396350;
    for(unsigned long s = 0; s < ULONG_MAX; ++s)
    {
        unsigned long long h = XXH32(&s, sizeof s, 0); //Replace XXH32 with XXH64 for XXH64 collisions
        if (h == t) printf("COLLISION!!!: %lu\n", s);
    }
}

I found a whole bunch of unsigned longs that all returned the same XXH32 value of 1727396350, some of which are listed down below:

90343688
4387316452
11971290027
16268262791
19557269070
25675257898
27143248113
33261236941
34729227156
40847215984
44136222263
48433195027
54551183855
56019174070
62137162898
65426169177
69723141941

The list goes on for a while... And I found all these in just a couple of minutes.

As for XXH64, it will take significantly more time to find a collision, which is to be expected since it has a 64 bit hash size, double that of XXH32. Not twice the amount of time mind you, but a lot more; since every bit doubles the amount of time required, it will (probably) take somewhere around 2^(64-32)=2^32=4294967296 times longer. That might sound like a lot, but considering I was able to find my first XXH32 collision in mere milliseconds, it might only take me a few years to find a collision on my laptop... without assembly optimization, or GPU acceleration, or FPGAs. An determined attacker could easily afford all of the above, and a bunch more computers to make things much, MUCH quicker. Heck, I could probably do that, if I didn't have better things to do. The most important part though is cryptanalysis: when an attack on this function is found (which should be dead-simple for any cryptographer out there), you'll probably be able to generate a collision in under a second on your 5 year-old smartphone, just like what happened to MD5.

Now for the usual security warning: XXHASH is NOT a cryptographic function. Do NOT use it anywhere where security matters. It is made for creating hashtables or hashsets, allowing for O(1) lookup in memory, just like djb2. They are made for one thing only: speed, and a lot of it. When looking up an item in a hash table, the program needs to hash the key as fast as possible so it can jump to the corresponding address. A collision there simply means that the program has to iterate over a few items until it finds what it's looking for, which is not a big deal.

Meanwhile, if even a hypothetical academic-grade reduction on a cryptographic hash function is found, it is a massive deal. Its only purpose is to withstand any attempts at generating a collision. Which is why, despite SHA1 requiring absolutely massive amounts of computing power to generate a collision, and only one collision found as of writing this (afaik), it is still widely considered insecure.

Modern cryptographic hash algorithms must also return hashes at least 128 bits in length, or 256 if you want to protect yourself from quantum computers, heck maybe even 384 if this whitepaper is to be believed (although it is disputed). Note that this does NOT mean that any hash function that returns a 128 bit hash is secure.

ytrezq · 2019-05-02T23:48:10Z

@OverclockedSanic however, I suppose the amount of data to process for xxhash64 makes things quite different. I mean in less than a year.

Hashing fonction is serial, quantum computing is no help except if the content of the data generated for matching the hash does not matter.

ramytarchichy · 2019-05-05T23:55:25Z

@ytrezq Not sure what you meant in the first part, but quantum computing, despite not breaking symmetric primitives like hash functions, definitely does affect them. An attack with grover's algorithm reduces the complexity for generating a collision from 2^n to 2^(n/2), meaning SHA3-256 in a post-quantum setting actually offers 128 bit of security as opposed to 256, or even ~85.33 if the whitepaper above is correct (in which case, you'd need to use at least SHA3-384 to be secure).

Again, doesn't break hash functions, since all you need to do is increase the size, unlike RSA and ECDSA which will become useless no matter the key size.

Sauce: https://en.wikipedia.org/wiki/SHA-3#Security_against_quantum_attacks
Grover's algorithm: https://en.wikipedia.org/wiki/Grover%27s_algorithm

sesse · 2020-07-03T09:22:28Z

The following program demonstrates an XXH64 collision:

#include <stdio.h>
#include <stdint.h>
#include <xxhash.h>

int main(void)
{
        const uint8_t str1[] = { 0xd0, 0x08, 0xad, 0x10, 0x02, 0x00, 0x00, 0x00, 0xd0, 0x08, 0xad, 0x10, 0x02, 0x00, 0x00, 0x00 };
        const uint8_t str2[] = { 0xa6, 0x03, 0x26, 0xc5, 0x01, 0x00, 0x00, 0x00, 0xa6, 0x03, 0x26, 0xc5, 0x01, 0x00, 0x00, 0x00 };
        printf("%lu\n", XXH64(str1, sizeof(str1), 156211));
        printf("%lu\n", XXH64(str2, sizeof(str2), 156211));
}

This was done simply by hashing lots of integers starting from zero, storing them all in a huge array, sorting said array and looking for duplicates. I couldn't get it to easily trigger without hashing the number twice, though; it seems there's some resistance to 64-bit collisions with only 64-bit inputs.

easyaspi314 · 2020-07-03T15:09:17Z

Or you could just use the known XXH64 multicollision:

#define XXH_INLINE_ALL
#include <xxhash.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <inttypes.h>

static void print_buf(const uint8_t *buffer, size_t len)
{
    printf("0x");
    for (size_t i = 0; i < len; i++) {
        printf("%02x", buffer[i]);
    }
}

// https://github.com/Cyan4973/xxHash/issues/54#issuecomment-414061026
int main(void)
{
    uint8_t real[64], collide[64], seedBuf[8];

    // Fill with random integers
    FILE *urandom = fopen("/dev/urandom", "rb");
    if (!urandom)
        return 1;
    if (fread(real,    sizeof(real),    1, urandom) != 1)
        return 1;
    if (fread(seedBuf, sizeof(seedBuf), 1, urandom) != 1)
        return 1;
    fclose(urandom);

    uint64_t seed = XXH_readLE64(seedBuf);

    // Copy the collision code
    memcpy(collide, real, sizeof(collide));

    uint64_t x = XXH64(real,    sizeof(real),    seed);
    printf("XXH64(");
    print_buf(real, 64);
    printf(", 64, %#016"PRIx64") = %#016"PRIx64"\n", seed, x);
    for (size_t i = 0; i < 256; i++) {
        // ((uint64_t *)collide)[i % 4] += 0x0BA79078168D4BAF;
        uint64_t a = XXH_readLE64(&collide[8 * i % 4]);
        a += 0x0BA79078168D4BAF;
        XXH_writeLE64(&collide[8 * i % 4], a);

        // ((uint64_t *)collide)[4 + i % 4] += 0x9C90005B80000000;
        uint64_t b = XXH_readLE64(&collide[32 + 8 * i % 4]);
        b += 0x9C90005B80000000;
        XXH_writeLE64(&collide[32 + 8 * i % 4], b);

        uint64_t y = XXH64(collide, sizeof(collide), seed);
        printf("XXH64(");
        print_buf(collide, 64);
        printf(", 64, %#016"PRIx64") = %#016"PRIx64"\n", seed, y);
    }
}

However, XXH64 is bijective for 64-bit inputs, so it is impossible for there to be a collision at that length.

yoeden · 2022-07-13T13:07:32Z

@ramytarchichy made for creating hashtables or hashsets, allowing for O(1) lookup in memory,
but what if I hash one of my datums and a collision occurs causing duplication of keys ?
it is necessary that each hash should be unique even in this scenario.

ramytarchichy · 2022-07-13T13:20:59Z

@Br4inFreze This happens all the time, and the hash table should be built in a way to handle such scenarios. Usually it's done by putting all values whose keys have the same hash in a list of key-value pairs. That way, the caller hashes the key, finds the list in O(1), then iterates through the list and compares keys to find the correct value.

In a best-case scenario, all lists have only 1 element in them and lookup is constant-time. In a worst-case scenario, all elements have the same hash, get put in the same list, and the time complexity for lookups goes up to O(n). But, as long as everything is designed properly, that is rarely the case, and we can assume O(1) amortized.

TL;DR: you should never assume that the hash function will not create collisions.

Cyan4973 added the question label Jan 20, 2019

Cyan4973 closed this as completed Mar 14, 2019

lmammino mentioned this issue Jul 12, 2019

Added support for xxHash32 fastify/fastify-etag#7

Closed

egmontkob-work mentioned this issue Aug 28, 2020

Xxhash128 firebuild/firebuild#248

Merged

Erotemic mentioned this issue Jan 23, 2021

Support hashings other than MD5 iterative/dvc#3069

Open

burdges mentioned this issue Nov 27, 2023

Remove unnecessary host functions allocator usage polkadot-fellows/RFCs#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any known xxhash collision so far? #165

Any known xxhash collision so far? #165

hiqbn commented Jan 20, 2019

Cyan4973 commented Jan 20, 2019 •

edited

hiqbn commented Jan 21, 2019

ramytarchichy commented Feb 12, 2019 •

edited

ytrezq commented May 2, 2019 •

edited

ramytarchichy commented May 5, 2019 •

edited

sesse commented Jul 3, 2020

easyaspi314 commented Jul 3, 2020 •

edited

yoeden commented Jul 13, 2022 •

edited

ramytarchichy commented Jul 13, 2022 •

edited

Any known xxhash collision so far? #165

Any known xxhash collision so far? #165

Comments

hiqbn commented Jan 20, 2019

Cyan4973 commented Jan 20, 2019 • edited

hiqbn commented Jan 21, 2019

ramytarchichy commented Feb 12, 2019 • edited

ytrezq commented May 2, 2019 • edited

ramytarchichy commented May 5, 2019 • edited

sesse commented Jul 3, 2020

easyaspi314 commented Jul 3, 2020 • edited

yoeden commented Jul 13, 2022 • edited

ramytarchichy commented Jul 13, 2022 • edited

Cyan4973 commented Jan 20, 2019 •

edited

ramytarchichy commented Feb 12, 2019 •

edited

ytrezq commented May 2, 2019 •

edited

ramytarchichy commented May 5, 2019 •

edited

easyaspi314 commented Jul 3, 2020 •

edited

yoeden commented Jul 13, 2022 •

edited

ramytarchichy commented Jul 13, 2022 •

edited