XXH128 #104

tonda-kriz · 2017-06-08T17:02:26Z

Hi, do you have any plans for creating 128 bit version( 16 bytes for resulting hash )?

Cyan4973 · 2017-06-08T18:43:24Z

Yes

tonda-kriz · 2017-06-08T19:38:57Z

Great, so you need 2 days or 2 weeks to make it done?

Cyan4973 · 2017-06-08T19:54:05Z

probably more than that.

It's difficult to commit to a date, since other activities just get in the way.

I suppose I could find some time to work on it this coming Summer.

tonda-kriz · 2017-06-09T07:44:49Z

Thanks, I'm really glad to hear that

data-man · 2017-09-07T16:42:06Z

@Cyan4973 Any progress? :)

Cyan4973 · 2017-09-07T16:48:27Z

Heck no, my timeslot for xxHash this Summer just vanished in front of other priorities.
Next slot is October / November.

yaohanlan · 2018-04-03T09:57:23Z

@Cyan4973 Any progress? :)

Cyan4973 · 2018-04-03T22:56:35Z

Sort of, but on the negative side.
I've tested several variants in the November / December timeframe, and basically ruled them off.
Quality was too low, or speed quickly went down below xxh64 level.

This leaves only a couple of options on the table.
And now, I also need another timeframe to work on these final candidates ....

Cyan4973 · 2018-04-03T23:39:06Z

Avoid shameless plug please

bryc · 2018-04-06T11:31:15Z

A well distributed 128-bit hash can just use full states v0,v1,v2,v3? Or is there more to it?

Cyan4973 · 2018-04-06T16:40:40Z

Yes,
this is, in essence, @JCash proposal,
use XXH64 "core", just modify finalization stage, to generate 128 bits output.

It works.
It's just I was hoping to find a variant that would run even faster than xxh64.
Such variant requires, in essence, vectorization instructions, to consume data faster.
But the instruction set is fairly limited, so it's hard to preserve both quality and speed.

It could be this was a bad idea after all, and @JCash approach, by virtue of being more straightforward, might prove the better one.

As stated, there are still a couple of ideas I wanted to try out before giving up on vectorization-based hashing.

Bramzor · 2018-04-06T22:20:12Z

What would the use case be for this 128bit hash? First I was thinking to lower the probability of collisions but I understood that this is already so low for 64bit hash that going towards 128bit wouldn't be efficient as it would take up double the storage and the gain is very low.

Am I missing something?

bryc · 2018-04-06T22:30:34Z

@Bramzor it becomes a consideration if you are hashing a high amount of traffic (billions of keys) and collision resistance is a high priority. It does not need to be expensive to compute, but storage is also a concern. The gains are higher, as each increase of hash capacity is exponential. That is why a 256-bit hash is considered unbreakable in crypto.

Bramzor · 2018-04-06T22:41:47Z

@bryc is there an idea/comparison between 64bit and 128bit or more in probability of collisions? I'm looking at using xxhash for "fingerprinting" of data for a very large number of data. Although false positives wouldn't be a big issue in my specific use case.

Cyan4973 · 2018-04-06T22:41:52Z

The table displayed at https://stackoverflow.com/a/14210379/646947 is interesting to answer to this question.

As stated by @bryc, 128 bits becomes useful when there are millions of entries to store, and user has no collision resolution mechanism in place, so wants to significantly reduce the chances of a single collision to ever happen.

Bramzor · 2018-06-14T22:17:38Z

Can we accept the pull request for the XXH128 code by JCash for calculating 128bit hashes? I don't want to use a not accepted pull request for this as it might cause issues if a different XXH128 version might come out later. But this will not happen if we can accept this pull request as the valid (first) XXH128 version? If needed, it's always possible to add a XXH128v2 later if there is a better/faster method right?

Cyan4973 · 2018-06-14T22:41:33Z

Even if I end up selecting a solution similar to @JCash proposal,
which becomes more likely as I am failing my vectorization attempts one after another,
this will nonetheless be a different version.

Reason is, we've learned a lot since then,
and the benefits of this learning can be used to improve @JCash proposal.

There are mostly 2 areas of improvements :

friendlier design for 32-bits cpus
more efficient shortcut with short input

The second one is pretty important.

Bulat-Ziganshin · 2018-06-14T23:27:04Z

@Bramzor if you don't need any specific functionality of xxHash framework, you can use one of well-established 128-bit hashes: SpookyHash, MurmurHash3 or HighwayHash

@Cyan4973 I think that my zzHash is 80% done as x86 hash with 128-bit output. Unfortunately, I don't finished it. The remaining things are:

i've started adding two levels of chunking so it can be efficiently computed even on GPUs. All other hashes are strictly sequential
finalization function. You have argued that it should be as fast as possible in order to compete with existing 32/64-bit hashes on small keys. Now I think that 128-bit hashes usually don't need much speed because they are used for long-term data storage (and SSD IOPS are much lower than hashes/s we can deliver), and 128-bit hashes are usually computed on 256+ bit inputs where zzHash is already pretty competitive. So I think that the full finalization code should be restored, even if it makes zzHash slower than xxHash32/64 on 1-16 byte inputs

By 80% I mean 80% of hash formula design, not counting efforts required for actual implementation with incremental hashing and so on.

zzHash speed is 2.5 bpc, compared to 2 bpc of xxHash32 and 4 bpc of xxHash64. zzHash-wide, employing x64 commands, is 5.5 bpc, and can deliver up to 512-bit results, but doesn't have any concrete design yet.

bryc · 2018-06-14T23:30:39Z

@Cyan4973

friendlier design for 32-bits cpus

😮 How would that work? That would be very nice indeed!

I have made an attempt to adapt 32-bit XXH32 code to produce XXH128 in JavaScript (code is C-like), and it is still fast, but very likely low quality hash (cannot really use smhasher, but seems v3-v4 have more collisions). So I would like to port that to JS too ;).

Cyan4973 · 2018-06-15T00:06:53Z

@Bulat-Ziganshin :
I'm totally fine with having XXH128 based on a version of zzHash.
I'm also convinced that XXH128 needs to be at least as fast as XXH64,
otherwise it will be perceived as a regression by users.
Plus, it's fairly easy to create a derivative of XXH64 which just outputs 128 bits instead, which is in essence what @JCash proposes. So that's the competition.

Don't worry too much about small inputs, it is a different set of optimizations, and must be dealt with a different code path. I can take care of that.

How would that work?

This is a bit more fuzzy, but one could start by considering 32-bits constants instead of 64-bits one,
this allows usage of instructions like imul, which can produce a 64-bits result from two 32-bits inputs. It's basically friendlier to emulate in 32-bits mode, and by vectorial instructions such as AVX2.
There is a side effect on the spread quality of the formula which must be carefully weighted though.

There are similar considerations to keep in mind when introducing shift or rotl operations.
Note that I'm not saying that 32-bits system will run as fast as 64-bits one : for that, one would need 32-bits instructions only (like XXH32), and that would drag down 64-bit performance.

Bulat-Ziganshin · 2018-06-15T00:32:29Z

@Cyan4973 what about my idea of developing m/t friendly hash? Do you want to include that in xxh? If you don't buy it, I can skip its development, although, being perfectionist, I would like to see that implemented.

The idea is: when input is longer than 4KB, it's split into 4 KB chunks, and we compute the same 160-bit raw hash value of each chunk as we do for smaller-than-4KB inputs. Then we perform hashing over this stream of 160-bit values. This allows to use up to 4K/20 = 200 threads for hashing, making overall computation up to 200x faster. For inputs larger than 800 KB, we may perform the second layer same way, allowing to use up to 40K threads that looks appropriate for modern GPUs.

Since algorithm used to compute raw 160-bit hash is the same for smaller inputs (smaller than 4KB) and the first chunk of larger inputs, we don't need to buffer data in incremental implementation. All we need is to store (in the hash state structure) extra 160 bits of current raw hash for each added hashing layer.

Cyan4973 · 2018-06-15T00:37:58Z

I like the idea.

Note though that current hash speed is already pretty high,
meaning that, I doubt we can get 40x boost by using 40 cores,
since the memory bandwidth will be limited anyway.

But even a more limited 2x/4x boost is still nice.

Thread synchronization is not free though.
For such throughput, it's actually dominant,
so you might want to consider a scheme which doesn't require a sync every 4K.

Bramzor · 2018-06-15T01:45:25Z

Premature optimization is the root of all evil....

having that said, I can only speak for myself but I think most people will have the same opinion which is: The speed of XXH64 is more than good enough (I use it for only hashing a small string for example) and therefore, people would be happy with just the same performance but as a 128bit hash. Why would people use the XXH128, that is because they like to use libraries they already are familiar with. The risk at this moment is that it will take some time for XXH128 to reach all the different languages so that might "force" people to use other hashing lib. I could go ahead for some time with XXH64 but at some time, I will need the 128bit anyway and if it will not be here anytime soon, I will have to give up XXH (which I really like) for some other hashing.

If XXHash wants to be the best (fastest) 128bit hash there is, I can respect that but than I would advise people (and myself) to go for something "that just works" and is commonly supported.

My 2 cents.

Bulat-Ziganshin · 2018-06-15T03:20:04Z

@Cyan4973 You are right, the goal is to make memory b/w the only limit, since the computations are so simple.

On Nvidia GPU, the commands processing each 4 input bytes in zzHash (2ADD+MUL+ROL) require about 10 cycles, resulting in the speed of 0.5-1 GB/s per thread. Even with 200 threads, we can't fill 900 GB/s memory b/w of Tesla V100. So we should either further increase the chunk size or add second layer of chunking. With two layers we can provide with work up to 40K threads, which is 20-40x more than required to fill the memory b/w. As time goes, GPUs get more ALUs, hence ratio of memory b/w per single ALU speed is increased, but this reserve should be enough for the next 20 years.

In order to fill memory b/w of V100 we need about 1K threads, so hash computation will be memory bound starting from 4 MB input size.

On GPUs, starting one warp (32 threads) is about as expensive as calling a subroutine, i..e it just requires to copy parameters and initialize locals, so the price is negligible compared to the price of processing 32*4K bytes. OTOH, by compressing each 4 KB of input into 20 bytes which then are hashed again, we add extra 20B/4KB = 0.5% work for inputs larger than 4 KB, which is again a negligible price.

It's why i choose 4 KB chunk size and, correspondingly, 200*4KB=800KB second layer chunk size.

Indeed, syncing on 4 KB chunks is inefficient on desktop CPUs/OSes. But we don't need to do it. Each thread in CPU implementation can process larger blocks containing multiple chunks and then either 1) store 160-bit raw hash of each chunk into memory for later processing, or 2) process 800 KB chunks entirely, computing their 160-bit hashes for the next hashing layer.

OTOH, 32/64 KB chunk size will be a bit more efficient for desktop CPUs (no 0.5% overhead, 8-16x less immediate data to store), but it will increase minimal efficiently processed input size for V100 from above-mentioned 4 MB to 32-64 MB.

I just pushed my first version of this algo (with only one extra layer). Key point of the main hash function: https://github.com/Bulat-Ziganshin/FARSH/blob/master/SMHasher/xxHashTest.cpp#L270

bryc · 2018-06-15T04:24:11Z

this allows usage of instructions like imul, which can produce a 64-bits result from two 32-bits inputs. It's basically friendlier to emulate in 32-bits mode, and by vectorial instructions such as AVX2.

@Cyan4973
Is there no way to keep all states in 32-bit unsigned ints (uint32) ? I've searched everywhere and only the 128-bit x86 MurmurHash3 has that characteristic, very rare for non-32bit hashes. Even some "x86 optimized" hash functions I found still use uint64 type for multiplication steps, which sadly cannot be done in JS natively without making 64-bit multiplication shim that kills performance.

Bulat-Ziganshin · 2018-06-15T04:32:37Z

@bryc zzHash (which is playground for one possible XXH128 implementation) has this feature too. While Cyan talked (in your citation) about another hash processing data in 64-bit chunks, thus employing 64-bit variables, but simplifying some operations in order to improve x86/SSE2 efficiency. Processing input in 64-bit chunks doubles performance on 64-bit CPUs, so you have to choose.

bryc · 2018-06-16T22:52:01Z

@Bulat-Ziganshin

Yeah that looks promising. Looking closer, it seems it can be ported to native JS since it uses pure 32-bit integers. I have tried, but my C++ ability is weak so I can't even compile SMhasher on VS. But I like the ideas/changes in combining/final mixing steps.

Anyway, 👍 for x86 XXH128 ;) maybe even x86 XXH64 with little extra work.

Cyan4973 · 2018-06-25T22:01:46Z

Another possible solution :
keep the XXH32 and XXH64 "cores" as is,
just change the finalization stage,
so that XXH32 can be used to produce a 64-bits hash,
and XXH64 can be used to produce a 128-bits hash.

This would be relatively trivial to implement.

So far, I've refrained from doing so, because of the risk of confusion :
So far, the 32 in the name refers to both the type of operations and the produced hash width.
This change would separate the 2 concepts.
But how to call these variants ?
In particular, how to make sure an XXH32 variant producing a 64-bits output is not confused with XXH64 ?

It sounds like a simple problem, but it was enough of a deterrent.

I nonetheless like it, because it keeps XXH128 available for some later possibility to implement a hash function working on units of 128 bits.

za3k · 2018-10-13T18:15:46Z

XXH32X2 ? Bit of a mouthful I know

Cyan4973 · 2018-10-15T18:18:40Z

Yeah, I'm now convinced that the confusion risk is significant enough to not be worth it.

So no strange xxh32->64 in sight for the time being.

XXH128 will be a hash producing a 128 bits output,
and not necessarily a hash working on 128 bits units.

Alain2019 · 2019-01-06T10:50:39Z

If you want to avoid the confusion you could also opt for another strategy:

XXH96 based on the 32 bit "engine".
XXH128 based on the 64 bit "engine".

Alain2019 · 2019-01-10T16:55:39Z

The past days i did some test with SpookyHash (V2) the 128 bit Hash version was a bit faster than xxHash64 (64bit compilation optimized for sandy bridge), both close to 10GB/s on Sandy Bridge 2500K. Tested 4K, 256K buffers (256K cached and non cached). At the same time SpookyHash has 64 and 32-bit versions of the resulting Hash. For smaller buffers I would do a test first, I saw a performance increase of more than 20% going from 1K to 4K.

So for those that are in need for a 128bit Hash running on 64bit hardware, it's worthwhile looking at SpookyHash. SpookyHash is developed in 2012 by Bob Jenkins, who already wrote about Hash functions in 1997 in Dr. Dobbs Journal.

Cyan4973 · 2019-03-14T03:57:53Z

Patch currently under review in #174

Repository owner deleted a comment from data-man Apr 3, 2018

Cyan4973 closed this as completed Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XXH128 #104

XXH128 #104

tonda-kriz commented Jun 8, 2017

Cyan4973 commented Jun 8, 2017

tonda-kriz commented Jun 8, 2017 •

edited

Cyan4973 commented Jun 8, 2017

tonda-kriz commented Jun 9, 2017

data-man commented Sep 7, 2017

Cyan4973 commented Sep 7, 2017

yaohanlan commented Apr 3, 2018

Cyan4973 commented Apr 3, 2018 •

edited

Cyan4973 commented Apr 3, 2018

bryc commented Apr 6, 2018

Cyan4973 commented Apr 6, 2018

Bramzor commented Apr 6, 2018

bryc commented Apr 6, 2018 •

edited

Bramzor commented Apr 6, 2018

Cyan4973 commented Apr 6, 2018

Bramzor commented Jun 14, 2018

Cyan4973 commented Jun 14, 2018

Bulat-Ziganshin commented Jun 14, 2018 •

edited

bryc commented Jun 14, 2018

Cyan4973 commented Jun 15, 2018

Bulat-Ziganshin commented Jun 15, 2018

Cyan4973 commented Jun 15, 2018

Bramzor commented Jun 15, 2018

Bulat-Ziganshin commented Jun 15, 2018

bryc commented Jun 15, 2018 •

edited

Bulat-Ziganshin commented Jun 15, 2018

bryc commented Jun 16, 2018

Cyan4973 commented Jun 25, 2018

za3k commented Oct 13, 2018

Cyan4973 commented Oct 15, 2018

Alain2019 commented Jan 6, 2019

Alain2019 commented Jan 10, 2019

Cyan4973 commented Mar 14, 2019 •

edited

XXH128 #104

XXH128 #104

Comments

tonda-kriz commented Jun 8, 2017

Cyan4973 commented Jun 8, 2017

tonda-kriz commented Jun 8, 2017 • edited

Cyan4973 commented Jun 8, 2017

tonda-kriz commented Jun 9, 2017

data-man commented Sep 7, 2017

Cyan4973 commented Sep 7, 2017

yaohanlan commented Apr 3, 2018

Cyan4973 commented Apr 3, 2018 • edited

Cyan4973 commented Apr 3, 2018

bryc commented Apr 6, 2018

Cyan4973 commented Apr 6, 2018

Bramzor commented Apr 6, 2018

bryc commented Apr 6, 2018 • edited

Bramzor commented Apr 6, 2018

Cyan4973 commented Apr 6, 2018

Bramzor commented Jun 14, 2018

Cyan4973 commented Jun 14, 2018

Bulat-Ziganshin commented Jun 14, 2018 • edited

bryc commented Jun 14, 2018

Cyan4973 commented Jun 15, 2018

Bulat-Ziganshin commented Jun 15, 2018

Cyan4973 commented Jun 15, 2018

Bramzor commented Jun 15, 2018

Bulat-Ziganshin commented Jun 15, 2018

bryc commented Jun 15, 2018 • edited

Bulat-Ziganshin commented Jun 15, 2018

bryc commented Jun 16, 2018

Cyan4973 commented Jun 25, 2018

za3k commented Oct 13, 2018

Cyan4973 commented Oct 15, 2018

Alain2019 commented Jan 6, 2019

Alain2019 commented Jan 10, 2019

Cyan4973 commented Mar 14, 2019 • edited

tonda-kriz commented Jun 8, 2017 •

edited

Cyan4973 commented Apr 3, 2018 •

edited

bryc commented Apr 6, 2018 •

edited

Bulat-Ziganshin commented Jun 14, 2018 •

edited

bryc commented Jun 15, 2018 •

edited

Cyan4973 commented Mar 14, 2019 •

edited