Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SipHash to benchmarks...? #162

Closed
boazsegev opened this issue Jan 3, 2019 · 6 comments
Closed

Add SipHash to benchmarks...? #162

boazsegev opened this issue Jan 3, 2019 · 6 comments

Comments

@boazsegev
Copy link

I just discovered this repo, and I love it 馃憤

I'm very impressed by both the function and the performance.

I'm curious if SipHash could be added to the benchmark suites? It has it's own Wikipedia page and appears to be implemented in many platforms and used by some Standard Libraries.

I know the benchmarks are half implementation and half algorithm design, but both Hash implementations appear mature enough to make curious if pushing Ruby and Python to adopt xxHash over SipHash might be a good idea.

Thanks!

@Cyan4973
Copy link
Owner

Cyan4973 commented Jan 4, 2019

Sure, it's a good suggestion.

siphash is portable (I believe), which is the main requirement to be comparable.
I suspect it is at a disadvantage for raw speed, though its main selling point is that it offers improved cryptographic protection regarding collision generation.

I need to find time to build a new benchmark platform for this though.

@boazsegev
Copy link
Author

Actually, I found some initial tests at this SMHasher fork, showing xxHash is at least twice as fast for small inputs and far faster for longer inputs.

However, I think this won't be enough...

[SipHash's] main selling point is that it offers improved cryptographic protection regarding collision generation.

I think this is bigger a requirement than I realized, especially due to the risks related to flood-hashing attacks (for anyone else reading this, here is a high-level explanation and a more detailed one).

I found this thread indicating that it's possible to produce collisions on demand.

I don't know if it's possible to use this weakness to create multi-collisions and lead to hash flooding attacks... but I think a resistance to collision generation should be considered a requirement for widespread adoption.

...

As a security related side note, by setting the first 32 bytes of a message to a specific value, all vectors are equal to the seed after the first round.

This might be exploited to extract seed data or compromise the hash. I'm playing around with my own variation where I use irreversible manipulations to prevent this.

@Cyan4973
Copy link
Owner

As I'm going to make a new round of benchmark for the next xxHash release, I may as well add siphash in the list.

@boazsegev
Copy link
Author

FYI:

I was inspired by XXHash's approach and used something similar in my own project / library (I named it RiskyHash - you can find the source code here).

BTW:

The more I read the less I'm impressed with the need for Hash function "security" where Hash Maps are concerned. IMHO, the Hash Map implementation should handle security concerns, not the Hashing function. Following this logic, using a faster Hashing function could add a significant performance boost.

@easyaspi314
Copy link
Contributor

Yes, IMO, SipHash is overkill. Even with the "fast" 1-3 variant, its performance is less than half of XXH32.

You only need a decent hash:

  1. The hash must be seedable. Any unseeded hash can be precalculated. Hashes have a virtually infinite number of collisions, it is just how easy it is to find them.
  2. No seed-independent collisions. The MurmurHash incident was such a big deal because you could easily generate values which would collide, regardless of the key. It doesn't matter what seed you use, that hash is fucked.
  3. It must have good distribution. This makes sure that the buckets are filled evenly, and makes it so you can resize to a power of 2 and avoid a modulus.
  4. It must be fast. The whole point of a hash table is to be as fast as possible. If you are spending 64 cycles a byte, why bother?

@Cyan4973
Copy link
Owner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants