Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any known xxhash collision so far? #165

Closed
hiqbn opened this issue Jan 20, 2019 · 9 comments
Closed

Any known xxhash collision so far? #165

hiqbn opened this issue Jan 20, 2019 · 9 comments
Labels

Comments

@hiqbn
Copy link

hiqbn commented Jan 20, 2019

Any known xxhash collision so far? Need some data for reference.

@Cyan4973
Copy link
Owner

Cyan4973 commented Jan 20, 2019

xxhash is a non-cryptographic hash algorithm.
It must be assumed that collisions can be produced on demand.
32 and 64 bits are way too small anyway to provide any protection, and can be brute-forced.

@hiqbn
Copy link
Author

hiqbn commented Jan 21, 2019

I would like to know what values have this collision. does anyone have the values? as mentioned, need data for reference.

@ramytarchichy
Copy link

ramytarchichy commented Feb 12, 2019

It's quite trivial to find some:

//add this to your compiler flags to run it faster: -O2 -march=native

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include "xxhash.h"

int main()
{
    const unsigned long long t = 1727396350;
    for(unsigned long s = 0; s < ULONG_MAX; ++s)
    {
        unsigned long long h = XXH32(&s, sizeof s, 0); //Replace XXH32 with XXH64 for XXH64 collisions
        if (h == t) printf("COLLISION!!!: %lu\n", s);
    }
}

I found a whole bunch of unsigned longs that all returned the same XXH32 value of 1727396350, some of which are listed down below:

90343688
4387316452
11971290027
16268262791
19557269070
25675257898
27143248113
33261236941
34729227156
40847215984
44136222263
48433195027
54551183855
56019174070
62137162898
65426169177
69723141941

The list goes on for a while... And I found all these in just a couple of minutes.

As for XXH64, it will take significantly more time to find a collision, which is to be expected since it has a 64 bit hash size, double that of XXH32. Not twice the amount of time mind you, but a lot more; since every bit doubles the amount of time required, it will (probably) take somewhere around 2^(64-32)=2^32=4294967296 times longer. That might sound like a lot, but considering I was able to find my first XXH32 collision in mere milliseconds, it might only take me a few years to find a collision on my laptop... without assembly optimization, or GPU acceleration, or FPGAs. An determined attacker could easily afford all of the above, and a bunch more computers to make things much, MUCH quicker. Heck, I could probably do that, if I didn't have better things to do. The most important part though is cryptanalysis: when an attack on this function is found (which should be dead-simple for any cryptographer out there), you'll probably be able to generate a collision in under a second on your 5 year-old smartphone, just like what happened to MD5.

Now for the usual security warning: XXHASH is NOT a cryptographic function. Do NOT use it anywhere where security matters. It is made for creating hashtables or hashsets, allowing for O(1) lookup in memory, just like djb2. They are made for one thing only: speed, and a lot of it. When looking up an item in a hash table, the program needs to hash the key as fast as possible so it can jump to the corresponding address. A collision there simply means that the program has to iterate over a few items until it finds what it's looking for, which is not a big deal.

Meanwhile, if even a hypothetical academic-grade reduction on a cryptographic hash function is found, it is a massive deal. Its only purpose is to withstand any attempts at generating a collision. Which is why, despite SHA1 requiring absolutely massive amounts of computing power to generate a collision, and only one collision found as of writing this (afaik), it is still widely considered insecure.

Modern cryptographic hash algorithms must also return hashes at least 128 bits in length, or 256 if you want to protect yourself from quantum computers, heck maybe even 384 if this whitepaper is to be believed (although it is disputed). Note that this does NOT mean that any hash function that returns a 128 bit hash is secure.

@ytrezq
Copy link

ytrezq commented May 2, 2019

@OverclockedSanic however, I suppose the amount of data to process for xxhash64 makes things quite different. I mean in less than a year.

Hashing fonction is serial, quantum computing is no help except if the content of the data generated for matching the hash does not matter.

@ramytarchichy
Copy link

ramytarchichy commented May 5, 2019

@ytrezq Not sure what you meant in the first part, but quantum computing, despite not breaking symmetric primitives like hash functions, definitely does affect them. An attack with grover's algorithm reduces the complexity for generating a collision from 2^n to 2^(n/2), meaning SHA3-256 in a post-quantum setting actually offers 128 bit of security as opposed to 256, or even ~85.33 if the whitepaper above is correct (in which case, you'd need to use at least SHA3-384 to be secure).

Again, doesn't break hash functions, since all you need to do is increase the size, unlike RSA and ECDSA which will become useless no matter the key size.

Sauce: https://en.wikipedia.org/wiki/SHA-3#Security_against_quantum_attacks
Grover's algorithm: https://en.wikipedia.org/wiki/Grover%27s_algorithm

@sesse
Copy link

sesse commented Jul 3, 2020

The following program demonstrates an XXH64 collision:

#include <stdio.h>
#include <stdint.h>
#include <xxhash.h>

int main(void)
{
        const uint8_t str1[] = { 0xd0, 0x08, 0xad, 0x10, 0x02, 0x00, 0x00, 0x00, 0xd0, 0x08, 0xad, 0x10, 0x02, 0x00, 0x00, 0x00 };
        const uint8_t str2[] = { 0xa6, 0x03, 0x26, 0xc5, 0x01, 0x00, 0x00, 0x00, 0xa6, 0x03, 0x26, 0xc5, 0x01, 0x00, 0x00, 0x00 };
        printf("%lu\n", XXH64(str1, sizeof(str1), 156211));
        printf("%lu\n", XXH64(str2, sizeof(str2), 156211));
}

This was done simply by hashing lots of integers starting from zero, storing them all in a huge array, sorting said array and looking for duplicates. I couldn't get it to easily trigger without hashing the number twice, though; it seems there's some resistance to 64-bit collisions with only 64-bit inputs.

@easyaspi314
Copy link
Contributor

easyaspi314 commented Jul 3, 2020

Or you could just use the known XXH64 multicollision:

#define XXH_INLINE_ALL
#include <xxhash.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <inttypes.h>

static void print_buf(const uint8_t *buffer, size_t len)
{
    printf("0x");
    for (size_t i = 0; i < len; i++) {
        printf("%02x", buffer[i]);
    }
}

// https://github.com/Cyan4973/xxHash/issues/54#issuecomment-414061026
int main(void)
{
    uint8_t real[64], collide[64], seedBuf[8];

    // Fill with random integers
    FILE *urandom = fopen("/dev/urandom", "rb");
    if (!urandom)
        return 1;
    if (fread(real,    sizeof(real),    1, urandom) != 1)
        return 1;
    if (fread(seedBuf, sizeof(seedBuf), 1, urandom) != 1)
        return 1;
    fclose(urandom);

    uint64_t seed = XXH_readLE64(seedBuf);

    // Copy the collision code
    memcpy(collide, real, sizeof(collide));

    uint64_t x = XXH64(real,    sizeof(real),    seed);
    printf("XXH64(");
    print_buf(real, 64);
    printf(", 64, %#016"PRIx64") = %#016"PRIx64"\n", seed, x);
    for (size_t i = 0; i < 256; i++) {
        // ((uint64_t *)collide)[i % 4] += 0x0BA79078168D4BAF;
        uint64_t a = XXH_readLE64(&collide[8 * i % 4]);
        a += 0x0BA79078168D4BAF;
        XXH_writeLE64(&collide[8 * i % 4], a);

        // ((uint64_t *)collide)[4 + i % 4] += 0x9C90005B80000000;
        uint64_t b = XXH_readLE64(&collide[32 + 8 * i % 4]);
        b += 0x9C90005B80000000;
        XXH_writeLE64(&collide[32 + 8 * i % 4], b);

        uint64_t y = XXH64(collide, sizeof(collide), seed);
        printf("XXH64(");
        print_buf(collide, 64);
        printf(", 64, %#016"PRIx64") = %#016"PRIx64"\n", seed, y);
    }
}

However, XXH64 is bijective for 64-bit inputs, so it is impossible for there to be a collision at that length.

@yoeden
Copy link

yoeden commented Jul 13, 2022

@ramytarchichy made for creating hashtables or hashsets, allowing for O(1) lookup in memory,
but what if I hash one of my datums and a collision occurs causing duplication of keys ?
it is necessary that each hash should be unique even in this scenario.

@ramytarchichy
Copy link

ramytarchichy commented Jul 13, 2022

@Br4inFreze This happens all the time, and the hash table should be built in a way to handle such scenarios. Usually it's done by putting all values whose keys have the same hash in a list of key-value pairs. That way, the caller hashes the key, finds the list in O(1), then iterates through the list and compares keys to find the correct value.

In a best-case scenario, all lists have only 1 element in them and lookup is constant-time. In a worst-case scenario, all elements have the same hash, get put in the same list, and the time complexity for lookups goes up to O(n). But, as long as everything is designed properly, that is rarely the case, and we can assume O(1) amortized.

TL;DR: you should never assume that the hash function will not create collisions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants