Just a place to store graphs and benchmark results #236

Cyan4973 · 2019-07-26T19:20:51Z

Performance graphs, comparing `XXH3` with other portable non-cryptographic hashes :

Bandwidth on large data (~100 KB)

Bandwidth on x64 on variable size data of len 2^N

Bandwidth on x86 on variable size data of len 2^N (32-bit friendliness)

Following benchmarks concentrate on small data :

Throughput on small data of fixed length N
Throughput tests try to generate as many hashes as possible. This is the easiest methodology for benchmark measurements, but mostly representative of "batch mode" scenarios.

Throughput on small data of random length [1-N]
When input size is a random length, it's much harder for the branch predictor to properly select the right path in the algorithm. The impact on performance can be major.

Latency on small data of fixed length N
Latency tests require the algorithm to complete one hash before starting the next one. This is generally more representative of situations where the hash is integrated into a larger algorithm.

Latency on small data of random length [1-N]

The text was updated successfully, but these errors were encountered:

Cyan4973 · 2019-08-16T15:32:40Z

Collision study

It's the first test I'm aware of able to directly probe the collision efficiency of 64-bit hashes (32-bit collision efficiency was already well covered by SMHasher). Unfortunately, 128-bit collision analysis is still way off.

The test generates a massive amount of hashes, and counts the number of collisions produced. It doesn't come cheap, and requires a lot of time, and a lot or RAM.
The generated input patterns are relatively close to each others (low hamming distances), with only a few bits of difference. It is supposed to be more difficult for low quality hashes.
The test indicates the nb of collisions expected for the amount of hashes. Measured nb of collisions does not have to be strictly identical, it's enough to be in the vicinity.

Testing 64-bit hashes on large inputs :

An input of 256 bytes is expected to always trigger the "long" mode for hashes algorithms featuring multiple modes depending on input length.

Algorithm	Input Len	Nb Hashes	Expected	Nb Collisions	Notes
XXH3	256	100 Gi	312.5	326
XXH64	256	100 Gi	312.5	294
XXH128 low 64-bit	512	100 Gi	312.5	321
XXH128 high 64-bit	512	100 Gi	312.5	325
mmh64a	256	100 Gi	312.5	310
city64	256	100 Gi	312.5	303
t1ha2	256	100 Gi	312.5	761	a bit too large, loses ~1bit of distribution
mumv2	256	100 Gi	312.5	1229	a bit too large, loses ~2bits of distribution
wyhash	256	100 Gi	312.5	1202	a bit too large, loses ~2bits of distribution
fnv64	256	100 Gi	312.5	303	very long test : 23h30, compared with 3h20 for `XXH3`
blake2b low 64-bit	256	100 Gi	312.5	296	very long test : 48h !
md5 low 64-bit	256	100 Gi	312.5	301	very long test : 58h !

t1ha2 and mumv2 both feature a collision rate which is not ideal. t1ha2's distribution is closer to 63-bit, and mumv2 is closer to 62-bit. Note though that it's not "horrible", both can be considered "reasonable" for most practical usages.
wyhash being an unofficial variation of mumv2 , it unsurprisingly inherits the same distribution issues.

Testing 64-bit non-portable hashes :

AES hashes are based on Intel's AES instruction extension, introduced alongside SSE4.2. They don't run on a different platform.

Algorithm	Type	Input Len	Nb Hashes	Expected	Nb Collisions	Notes
meowhash low 64-bit	AES	256	100 Gi	312.5	18200	pretty bad
aquahash low 64-bit	AES	256	100 Gi	312.5	3488136	unusable

AES implying cryptographic quality, one would logically expect that it's also good enough for dispersion and randomness. This assumption appears incorrect: AES based hashes feature surprisingly bad collision properties.

The collision ratio is too high to recommend using these algorithms as checksums.

Testing 64-bit hashes on small inputs :

The test on input 8 bytes => 8 bytes output can check if there is some "space reduction" happening at this length. If there is no space reduction, then the transformation is a perfect bijection, which guarantees no collisions for 2 different inputs.

Algorithm	Input Len	Nb Hashes	Expected	Nb Collisions	Notes
XXH3	8	100 Gi	312.5	0	`XXH3` is bijective for `len==8`. It guarantees no collision.
XXH64	8	100 Gi	312.5	0	`XXH64` is also bijective for `len==8`
mmh64a	8	100 Gi	312.5	0	bijective
city64	8	100 Gi	312.5	278
t1ha2	8	100 Gi	312.5	607	a bit too large, loses ~1bit of distribution
mumv2	8	100 Gi	312.5	544	a bit too large, loses ~1bit of distribution
wyhash	8	100 Gi	312.5	583	a bit too large, loses ~1bit of distribution
fnv64	8	100 Gi	312.5	26

Testing 128-bit hashes :

The only acceptable score for these tests is always 0. If the hash algorithm offers 128-bit of dispersion, the probability for a single collision to show up is smaller than winning the national lottery twice in a row.
In contrast with the 64-bit tests, due to resource limitation, the test does not provide a precise 128-bit collision estimation. It merely eliminates hashes with < 80-bit of effective distribution.

Algorithm	Input Len	Nb Hashes	Nb Collisions	Notes
XXH128	256	100 Gi	0
cityhash128	256	100 Gi	0
t1ha2 128-bit	256	100 Gi	14	distribution equivalent to ~70 bits

Tests on small input lengths

Algorithm	Input Len	Nb Hashes	Notes
XXH128	16	25 Gi	range 9-16
XXH128	32	25 Gi	range 17-128
XXH128	100	13 Gi	range 17-128
XXH128	200	13 Gi	range 129-240

hungptit · 2019-08-16T16:39:13Z

@Cyan4973 How can I reproduce your benchmark results? And if possible could you add wyhash into your graph? xxhash will be faster than wyhash for large enough size i.e about 256bytes in my tests, however, it would be very helpful to let users know about that.

Cyan4973 · 2019-08-16T17:06:58Z

@hungptit , I haven't yet open-sourced the collision analyzer. This effort still needs to be done. Maybe I'll add it later in the repository, typically under /tests directory.

For the time being, I'm just populating a few test results, trying to complete a picture.
I will certainly add aquahash and maybe wyhash if you feel that's interesting too.
It takes some time though, as it requires typically several hours per measurement, on a special server that can only be accessed from time to time.

hungptit · 2019-08-16T18:27:28Z

@Cyan4973 I plan to send you a PM to ask about the suggestion for collision analyzer and luckily you point me to this thread. I am looking forward to your collision analyzer and I would assume other users also eager to try it given the number years that you have been working on the excellent cryptographic applications. I think wyhash is an excellent hash and it would be helpful to users if you can include it in your benchmark. AquaHash (created by Andrew Rogers) is new and have not been through any serious battle yet and it is not portable so you do not need to worry about it for now. I can help to add it to your collision analyzer later once you can open-source it.

Cyan4973 · 2019-08-21T13:22:20Z

I will move graphs and tables into the "wiki" section, where it's more appropriate as documentation, I suppose.

For the time being, the first page is here :
https://github.com/Cyan4973/xxHash/wiki/Performance-comparison

I will continue with the collision analysis results.

Cyan4973 · 2019-08-22T11:43:59Z

Btw @hungptit , if you are interested in the performance graphs, as opposed to the collision analyzer, the benchmark tool is already available, within the repository at /tests/bench.
It's designed for (relatively) easy integration of other hashes.

Cyan4973 · 2019-08-25T21:34:29Z

I completed a 64-bit collision test for some AES-based algorithms.
The results are devastating, especially for Aquahash. I cannot recommend using this hash for anything.

hungptit · 2019-08-27T00:02:45Z

@Cyan4973 It is bad news for me because I may not be able to use AquaHash for my local cache :) However, it is good news since you have confirmed that AquaHash may not be good enough for normal use cases and I need to figure out something else. Do you have any result for wyhash and clhash? And how about other AES-based algorithms such as meow hash?

I look forward to using your collision analysis tool in the near future so I can quickly evaluate the quality of any new hash algorithm i.e performance and quality. Let me know if you need help with testing and/or evaluating your analyzer.

Cyan4973 · 2019-08-29T16:41:09Z

Do you have any result for wyhash and clhash?

I've updated the table with results from more hashes.
clhash still needs to be done.

hungptit · 2019-08-30T03:48:47Z

@Cyan4973 I think your benchmark + analysis is good enough to write an excellent article. I would love to read about how you compare hash functions mathematically with some benchmark numbers :)

easyaspi314 · 2019-08-30T17:37:19Z

Slight error: AES is SSE4.2.

Cyan4973 · 2019-08-30T18:32:45Z

On the Intel Intrinsic Guide, it's not even listed as part of SSE : it seems to be in its own category :
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=aes

That's may be why it has its own switch -maes.

That being said, it was released "at the same time as", say, SSE4.2, so it's easy to make the equivalence.

svpv · 2019-09-26T22:43:32Z

Was it meowhash 0.5 or a previous version?

svpv · 2019-09-26T23:06:04Z

The 1-bit loss in t1ha2 is mostly due to wide multiplication as one of the last steps in mixing the final result. The thing is, the operation of the form lo, hi = acc * K, lo ^ hi is not invertible. In this gist I studied 32x32 multiplication exhaustively. The collision rate is worse with lo ^ hi than with lo + hi, but lo + hi retains more bit-level correlations. On the other hand, wide multiplication can be viewed as a security primitive: it seems that you can't get back to the data other than by brute force, which is in about 2^32 or 2^64 steps (I tried z3 solver and indeed I couldn't).

easyaspi314 · 2019-09-27T06:04:47Z

You don't need that much brute force to understand that.

Just print out a small table, as multiplication properties scale.

      add       |       xor      
0 0 0 0 0 0 0 0 | 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 | 0 1 2 3 4 5 6 7
0 2 4 6 1 3 5 7 | 0 2 4 6 1 3 5 7
0 3 6 2 5 0 4 7 | 0 3 6 0 5 6 0 7
0 4 1 5 2 6 3 7 | 0 4 1 5 2 6 3 7
0 5 3 0 6 4 1 7 | 0 5 3 6 6 2 5 7
0 6 5 4 3 1 0 7 | 0 6 5 0 3 5 0 7
0 7 7 7 7 7 7 7 | 0 7 7 7 7 7 7 7
0:   18 (28.1%)   19 (29.7%)
1:    5 (7.8%)     3 (4.7%)
2:    4 (6.2%)     4 (6.2%)
3:    6 (9.4%)     6 (9.4%)
4:    6 (9.4%)     3 (4.7%)
5:    6 (9.4%)     8 (12.5%)
6:    6 (9.4%)     8 (12.5%)
7:   13 (20.3%)   13 (20.3%)

You can pretty clearly see that xor breaks up some of the patterns, but has more bias on the high bits because it saturates instead of overflowing. It is also less symmetrical.

That is the main reason for xor folding those high bits every so often. That spreads stuff to the lower bits which don't fare well.

Cyan4973 · 2019-10-04T19:41:27Z

Good news :
I can finally measure collision ratio of 128-bit hash functions.
So I will start updating the tables with these results.

The primary intention is to check that XXH128() delivers a perfect collision score (0) at multiple input length, to validate each internal variant.

The collision generator cannot exactly reach 128-bit level, that would require an insane amount of resources. But it can get close to ~80-bit, and that's enough to break some hash algorithms which announce 128-bit of variance. It's an interesting result, because such limitation was invisible up to now.

easyaspi314 · 2019-11-03T00:17:12Z

By the way, I thought I said this before but I guess I didn't.

AES implying cryptographic quality, one would logically expect that it's also good enough for dispersion and randomness. This assumption appears incorrect: AES based hashes feature surprisingly bad collision properties.

That is because AES isn't a hash function. It is an encryption algorithm.

Why AES sucks for hashing

Encryption is just a fancy permutation, and it is designed to be reversible.

Hashing is more than a fancy permutation, and it is designed to be irreversible and highlight one-bit changes by changing basically most the bits.

AES doesn't, and it fails miserably at the last property. Here is how AES (from _mm_aesenc_si128) reacts to single bit changes:

Key:
f181bfe03e942b6332fb552248a63051
e9254323fdb11aff3b13d720f18bbfe2 189ec62f3f0478a6a65b841445333341
e9354323fdb11aff3b13d720f18bbfe2 189ec62f3f0478a6a65b8414a57a9ae8
  ^^

7ec24a13eead45e9036831a9988292be c1bfa5fb939a5feb39d9a85616cbe16d
7ec24a13eead45e9236831a9988292be c1bfa5fb939a5feb8384f5b116cbe16d
                ^^

b6bb76d0bab1e391d1d41c90b6280f79 044ddb11131aebbaab64b775807416b8
b6bb76d0bab1e391d1d41c90b6280779 044ddb11a0d49609ab64b775807416b8
                            ^^

226720675023cf81cc68035f26c815e1 252beff88b231b9b3cb597b86c0a07d9
226720675023cf01cc68035f26c815e1 252beff88b231b9b4cc507586c0a07d9
              ^^

e59cf0c82dc29d4a52537261cce58333 b4119d68bd4afeda180beec0164733c9
a59cf0c82dc29d4a52537261cce58333 11ce4212bd4afeda180beec0164733c9
^^

eb8329ba04f622075548d06a2a0f4fc6 3ac8338147a19f4b2bdada7f272c021e
eb8329ba04f622075548d06a2a0f4f46 d4261a4647a19f4b2bdada7f272c021e
                              ^^

17f3c1b43d1408b075d495f9089be48b 08dda6414c151a2ab0211e279645dd15
17f3c1b43d1408b055d495f9089be48b 08dda6414c151a2a72407f849645dd15
                ^^

9e90bbc1971009677a3376c9f976e0ed cf18f1a5f2c13132baf1ec941d21b721
9e90bbc1971089677a3376c9f976e0ed cf18f1a5f2c13132baf1ec94bbd0e087
            ^^

As you can see, only 32 bits are affected, since it is treated as separate lanes.

Source for output (not my best code lol)

// _mm_aesenc_si128 equivalent
// https://github.com/veorq/aesenc-noNI, modified to not overwrite
#include <stdint.h>

static const uint8_t sbox[256] =
{ 0x63, 0x7c, 0x77, 0x7b, 0xf2, 0x6b, 0x6f, 0xc5, 0x30, 0x01, 0x67, 0x2b, 0xfe,
  0xd7, 0xab, 0x76, 0xca, 0x82, 0xc9, 0x7d, 0xfa, 0x59, 0x47, 0xf0, 0xad, 0xd4,
  0xa2, 0xaf, 0x9c, 0xa4, 0x72, 0xc0, 0xb7, 0xfd, 0x93, 0x26, 0x36, 0x3f, 0xf7,
  0xcc, 0x34, 0xa5, 0xe5, 0xf1, 0x71, 0xd8, 0x31, 0x15, 0x04, 0xc7, 0x23, 0xc3,
  0x18, 0x96, 0x05, 0x9a, 0x07, 0x12, 0x80, 0xe2, 0xeb, 0x27, 0xb2, 0x75, 0x09,
  0x83, 0x2c, 0x1a, 0x1b, 0x6e, 0x5a, 0xa0, 0x52, 0x3b, 0xd6, 0xb3, 0x29, 0xe3,
  0x2f, 0x84, 0x53, 0xd1, 0x00, 0xed, 0x20, 0xfc, 0xb1, 0x5b, 0x6a, 0xcb, 0xbe,
  0x39, 0x4a, 0x4c, 0x58, 0xcf, 0xd0, 0xef, 0xaa, 0xfb, 0x43, 0x4d, 0x33, 0x85,
  0x45, 0xf9, 0x02, 0x7f, 0x50, 0x3c, 0x9f, 0xa8, 0x51, 0xa3, 0x40, 0x8f, 0x92,
  0x9d, 0x38, 0xf5, 0xbc, 0xb6, 0xda, 0x21, 0x10, 0xff, 0xf3, 0xd2, 0xcd, 0x0c,
  0x13, 0xec, 0x5f, 0x97, 0x44, 0x17, 0xc4, 0xa7, 0x7e, 0x3d, 0x64, 0x5d, 0x19,
  0x73, 0x60, 0x81, 0x4f, 0xdc, 0x22, 0x2a, 0x90, 0x88, 0x46, 0xee, 0xb8, 0x14,
  0xde, 0x5e, 0x0b, 0xdb, 0xe0, 0x32, 0x3a, 0x0a, 0x49, 0x06, 0x24, 0x5c, 0xc2,
  0xd3, 0xac, 0x62, 0x91, 0x95, 0xe4, 0x79, 0xe7, 0xc8, 0x37, 0x6d, 0x8d, 0xd5,
  0x4e, 0xa9, 0x6c, 0x56, 0xf4, 0xea, 0x65, 0x7a, 0xae, 0x08, 0xba, 0x78, 0x25,
  0x2e, 0x1c, 0xa6, 0xb4, 0xc6, 0xe8, 0xdd, 0x74, 0x1f, 0x4b, 0xbd, 0x8b, 0x8a,
  0x70, 0x3e, 0xb5, 0x66, 0x48, 0x03, 0xf6, 0x0e, 0x61, 0x35, 0x57, 0xb9, 0x86,
  0xc1, 0x1d, 0x9e, 0xe1, 0xf8, 0x98, 0x11, 0x69, 0xd9, 0x8e, 0x94, 0x9b, 0x1e,
  0x87, 0xe9, 0xce, 0x55, 0x28, 0xdf, 0x8c, 0xa1, 0x89, 0x0d, 0xbf, 0xe6, 0x42,
  0x68, 0x41, 0x99, 0x2d, 0x0f, 0xb0, 0x54, 0xbb, 0x16 };

#define XT(x) (((x) << 1) ^ ((((x) >> 7) & 1) * 0x1b))

void aesenc (uint8_t *out, const uint8_t *s, const uint8_t *rk) {
    uint8_t i, t, u, v[4][4];
    for (i = 0; i < 16; ++i) v[((i / 4) + 4 - (i%4) ) % 4][i % 4] = sbox[s[i]];
    for (i = 0; i < 4; ++i) {
        t = v[i][0];
        u = v[i][0] ^ v[i][1] ^ v[i][2] ^ v[i][3];
        v[i][0] ^= u ^ XT(v[i][0] ^ v[i][1]);
        v[i][1] ^= u ^ XT(v[i][1] ^ v[i][2]);
        v[i][2] ^= u ^ XT( v[i][2] ^ v[i][3]);
        v[i][3] ^= u ^ XT(v[i][3] ^ t);
    }
    for (i = 0; i < 16; ++i) out[i] = v[i / 4][i % 4] ^ rk[i];
}

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

static void print_buf(const uint8_t *buf)
{
    for (int i = 0; i < 16; i++) {
        printf("%02x", buf[i]);
    }
}

int main(void)
{
    srand(time(NULL));
    uint8_t input[16], output[16], key[16];

    for (int i = 0; i < 16; i++) {
        key[i] = rand();
    }
    printf("Key:\n");
    print_buf(key);
    for (int j = 0; j < 8; j++) {
        for (int i = 0; i < 16; i++) {
            input[i] = rand();
        }
        printf("\n");
        print_buf(input);
        printf(" ");
        aesenc(output, input, key);
        print_buf(output);
        int mask = rand();
        input[(mask / 8) % 16] ^= (1 << (mask % 8));

        printf("\n");
        print_buf(input);
        printf(" ");
        aesenc(output, input, key);
        print_buf(output);
        printf("\n%*s\n", ((mask / 8) % 16) * 2 + 2, "^^");
    }
}

Additionally, AES cancels itself out:

aesenc(x, aesenc(x, y)) == y

While this is fine for encryption since you only use input once and key is constant, in hashing, this is a terrible property, since you can just erase the existence of something.

So if you had a hash that looked like this....

uint128_t hash(uint128_t seed, const uint8_t *input, size_t len)
{
    uint128_t acc = seed;
    for (size_t i = 0; i < len / 16; i++) {
        acc ^= aesenc(read128(input + 16 * i), acc);
    }
    if (len % 16) {
        acc ^= aesenc(readpartial128(input + (len & ~(size_t)15), len & 15), acc);
    }
    return acc;
}

You have 30 seconds to find at least 340282366920938463463374607431768211456 length extension collisions with this:

00112233445566778899AABBCCDDEEFF

Lastly, AES is terribly slow if you don't have AES-NI or ARM crypto. While AES-NI is pretty common, with all Intels >= Nehalem and AMDs >= Bulldozer (iirc), ARM crypto is a pretty recent optional extension that many manufacturers leave out for cost and simplicity (see also: integer division in ARMv7)

And we have more than proven that you can write a good, fast hash function without any non-portable instructions.

Just see all xxHash variants and basically any scalar hash.

easyaspi314 · 2019-12-07T18:13:06Z

Yawn, hopefully back from my hiatus and I got a new PC!

AMD Ryzen 5 3600
16 GB RAM
Windows 10
Clang 9.0.0 (from VS2019)

Scalar

xxhsum.exe 0.7.2 (64-bits x86_64 little endian), Clang 9.0.0 (tags/RELEASE_900/final), by Yann Collet
Sample of 100 KB...
XXH32                    :     102400 ->    18963 it/s ( 1851.9 MB/s)
XXH32 unaligned          :     102400 ->    18904 it/s ( 1846.1 MB/s)
XXH64                    :     102400 ->    36571 it/s ( 3571.4 MB/s)
XXH64 unaligned          :     102400 ->    36572 it/s ( 3571.5 MB/s)
XXH3_64b                 :     102400 ->    25869 it/s ( 2526.2 MB/s)
XXH3_64b unaligned       :     102400 ->    26034 it/s ( 2542.4 MB/s)
XXH3_64b seeded          :     102400 ->    25815 it/s ( 2521.0 MB/s)
XXH3_64b seeded unaligne :     102400 ->    26034 it/s ( 2542.4 MB/s)
XXH128                   :     102400 ->    26034 it/s ( 2542.4 MB/s)
XXH128 unaligned         :     102400 ->    25815 it/s ( 2521.0 MB/s)
XXH128 seeded            :     102400 ->    25815 it/s ( 2521.0 MB/s)
XXH128 seeded unaligned  :     102400 ->    25815 it/s ( 2521.0 MB/s)

SSE2

xxhsum.exe 0.7.2 (64-bits x86_64 + SSE2 little endian), Clang 9.0.0 (tags/RELEASE_900/final), by Yann Collet
Sample of 100 KB...
XXH32                    :     102400 ->    18963 it/s ( 1851.9 MB/s)
XXH32 unaligned          :     102400 ->    19081 it/s ( 1863.4 MB/s)
XXH64                    :     102400 ->    36830 it/s ( 3596.7 MB/s)
XXH64 unaligned          :     102400 ->    37012 it/s ( 3614.5 MB/s)
XXH3_64b                 :     102400 ->   307200 it/s (30000.0 MB/s)
XXH3_64b unaligned       :     102400 ->   307200 it/s (30000.0 MB/s)
XXH3_64b seeded          :     102400 ->   302602 it/s (29551.0 MB/s)
XXH3_64b seeded unaligne :     102400 ->   307200 it/s (30000.0 MB/s)
XXH128                   :     102400 ->   279273 it/s (27272.7 MB/s)
XXH128 unaligned         :     102400 ->   279273 it/s (27272.7 MB/s)
XXH128 seeded            :     102400 ->   279273 it/s (27272.7 MB/s)
XXH128 seeded unaligned  :     102400 ->   279273 it/s (27272.7 MB/s)

AVX2 (hot damn!)

xxhsum.exe 0.7.2 (64-bits x86_64 + AVX2 little endian), Clang 9.0.0 (tags/RELEASE_900/final), by Yann Collet
Sample of 100 KB...
XXH32                    :     102400 ->    19081 it/s ( 1863.4 MB/s)
XXH32 unaligned          :     102400 ->    19021 it/s ( 1857.5 MB/s)
XXH64                    :     102400 ->    37463 it/s ( 3658.5 MB/s)
XXH64 unaligned          :     102400 ->    37463 it/s ( 3658.5 MB/s)
XXH3_64b                 :     102400 ->   768000 it/s (75000.0 MB/s)
XXH3_64b unaligned       :     102400 ->   768000 it/s (75000.0 MB/s)
XXH3_64b seeded          :     102400 ->   768000 it/s (75000.0 MB/s)
XXH3_64b seeded unaligne :     102400 ->   614400 it/s (60000.0 MB/s)
XXH128                   :     102400 ->   614400 it/s (60000.0 MB/s)
XXH128 unaligned         :     102400 ->   614400 it/s (60000.0 MB/s)
XXH128 seeded            :     102400 ->   614400 it/s (60000.0 MB/s)
XXH128 seeded unaligned  :     102400 ->   614400 it/s (60000.0 MB/s)

Cyan4973 · 2019-12-10T20:43:45Z

Nice new config @easyaspi314 !

And an AMD platform is welcomed, as it provides a pretty nice comparison point to measurements on Intel ones !

2 details that stand out :

The scalar code path seems very slow, including for legacy XXH32 and XXH64. That's really unexpected. I'm wondering if there is an explanation.
Some of the produced figures, especially on the vector side, feel too "round". I'm wondering if there is some sort of side effect of limited timer resolution. I'm going to check that into the code.
In the meantime, I suspect that benchHash would provide some more reliable figures, as its timer logic is more sophisticated.

easyaspi314 · 2019-12-10T20:44:57Z

Oh oof, just realized that CMake was on RelWithDbgInfo which does /Ob1.

That would explain it, since the read functions are just static

I'll get a new bench later.

Cyan4973 · 2019-12-10T21:14:40Z

I've also updated the internal benchmark function
so that the risk of producing a significant rounding error
due to a too small measurement period compared to timer's resolution
is now better controlled.

easyaspi314 · 2019-12-11T00:41:17Z

Timer inaccuracy would explain a lot.
My old Dell would get 4800.0 on XXH3 and randomly jump to 9600.0.

Since it definitely wasn't going into super saiyan mode (Core 2 has no Turbo Boost), the best explanation is an inaccurate timer.

There were other places where I got an even multiple, like the POWER9 which got 30000.0.

easyaspi314 · 2019-12-11T17:57:17Z

This is a lot more believable!
Scalar

xxhsum.exe 0.7.2 (64-bits x86_64 little endian), Clang 9.0.0 (tags/RELEASE_900/final), by Yann Collet
Sample of 100 KB...
XXH32                    :     102400 ->    80603 it/s ( 7871.4 MB/s)
XXH32 unaligned          :     102400 ->    80858 it/s ( 7896.3 MB/s)
XXH64                    :     102400 ->   161041 it/s (15726.6 MB/s)
XXH64 unaligned          :     102400 ->   160885 it/s (15711.4 MB/s)
XXH3_64b                 :     102400 ->   107670 it/s (10514.6 MB/s)
XXH3_64b unaligned       :     102400 ->   106251 it/s (10376.1 MB/s)
XXH3_64b seeded          :     102400 ->   108521 it/s (10597.8 MB/s)
XXH3_64b seeded unaligne :     102400 ->   105597 it/s (10312.2 MB/s)
XXH128                   :     102400 ->    97262 it/s ( 9498.3 MB/s)
XXH128 unaligned         :     102400 ->    97017 it/s ( 9474.3 MB/s)
XXH128 seeded            :     102400 ->    97256 it/s ( 9497.7 MB/s)
XXH128 seeded unaligned  :     102400 ->    96612 it/s ( 9434.8 MB/s)

SSE2

xxhsum.exe 0.7.2 (64-bits x86_64 + SSE2 little endian), Clang 9.0.0 (tags/RELEASE_900/final), by Yann Collet
Sample of 100 KB...
XXH32                    :     102400 ->    81174 it/s ( 7927.1 MB/s)
XXH32 unaligned          :     102400 ->    81086 it/s ( 7918.6 MB/s)
XXH64                    :     102400 ->   160899 it/s (15712.8 MB/s)
XXH64 unaligned          :     102400 ->   161041 it/s (15726.7 MB/s)
XXH3_64b                 :     102400 ->   310023 it/s (30275.7 MB/s)
XXH3_64b unaligned       :     102400 ->   307509 it/s (30030.2 MB/s)
XXH3_64b seeded          :     102400 ->   308900 it/s (30166.0 MB/s)
XXH3_64b seeded unaligne :     102400 ->   305375 it/s (29821.8 MB/s)
XXH128                   :     102400 ->   275146 it/s (26869.7 MB/s)
XXH128 unaligned         :     102400 ->   272461 it/s (26607.6 MB/s)
XXH128 seeded            :     102400 ->   274429 it/s (26799.7 MB/s)
XXH128 seeded unaligned  :     102400 ->   274066 it/s (26764.3 MB/s)

AVX2

xxhsum.exe 0.7.2 (64-bits x86_64 + AVX2 little endian), Clang 9.0.0 (tags/RELEASE_900/final), by Yann Collet
Sample of 100 KB...
XXH32                    :     102400 ->    81895 it/s ( 7997.6 MB/s)
XXH32 unaligned          :     102400 ->    81801 it/s ( 7988.4 MB/s)
XXH64                    :     102400 ->   162077 it/s (15827.8 MB/s)
XXH64 unaligned          :     102400 ->   161465 it/s (15768.1 MB/s)
XXH3_64b                 :     102400 ->   629577 it/s (61482.1 MB/s)
XXH3_64b unaligned       :     102400 ->   628594 it/s (61386.1 MB/s)
XXH3_64b seeded          :     102400 ->   628080 it/s (61335.9 MB/s)
XXH3_64b seeded unaligne :     102400 ->   626429 it/s (61174.7 MB/s)
XXH128                   :     102400 ->   573136 it/s (55970.3 MB/s)
XXH128 unaligned         :     102400 ->   573135 it/s (55970.2 MB/s)
XXH128 seeded            :     102400 ->   571006 it/s (55762.3 MB/s)
XXH128 seeded unaligned  :     102400 ->   569946 it/s (55658.8 MB/s)

Everything lines up properly now, and the results are a lot more accurate.

Especially scalar which fits the "roughly 1.5x the speed of XXH32" pattern I have typically seen.

easyaspi314 · 2019-12-11T18:07:29Z

Here is x86 scalar (with XXH3_hashLong_internal_loop forced inline)

xxhsum.exe 0.7.2 (32-bits i386 little endian), Clang 9.0.0 (tags/RELEASE_900/final), by Yann Collet
Sample of 100 KB...
XXH32                    :     102400 ->    81161 it/s ( 7925.9 MB/s)
XXH32 unaligned          :     102400 ->    80282 it/s ( 7840.0 MB/s)
XXH64                    :     102400 ->    29157 it/s ( 2847.4 MB/s)
XXH64 unaligned          :     102400 ->    29171 it/s ( 2848.8 MB/s)
XXH3_64b                 :     102400 ->    59114 it/s ( 5772.9 MB/s)
XXH3_64b unaligned       :     102400 ->    58609 it/s ( 5723.5 MB/s)
XXH3_64b seeded          :     102400 ->    58959 it/s ( 5757.7 MB/s)
XXH3_64b seeded unaligne :     102400 ->    58092 it/s ( 5673.1 MB/s)
XXH128                   :     102400 ->    62074 it/s ( 6061.9 MB/s)
XXH128 unaligned         :     102400 ->    61971 it/s ( 6051.8 MB/s)
XXH128 seeded            :     102400 ->    61136 it/s ( 5970.3 MB/s)
XXH128 seeded unaligned  :     102400 ->    60714 it/s ( 5929.1 MB/s)

Also matches the "3/4 the speed of XXH32 on 32-bit" pattern as well. Everything seems to check out.

I say definitely remove this hack:

#if defined(__clang__) && (XXH_VECTOR==0) && !defined(__AVX2__) && !defined(__arm__) && !defined(__thumb__)
static void
#else
XXH_FORCE_INLINE void
#endif

Cyan4973 · 2019-12-11T18:15:22Z

I kind of remember that XXH3 used to be slightly faster than XXH32 on 32-bit x86 (scalar code path) (see #236 (comment)).

So, somehow, this property was lost during later updates ?

easyaspi314 · 2019-12-11T18:28:17Z

I do see a few tweaks that change the performance, such as putting the multiply over the block, but I am thinking that it is just that XXH32 is faster in relation (AMD seems to have a better ROL) or compiler differences (you used GCC, no?).

I am going to try in msys2 with GCC.

Cyan4973 · 2019-12-11T18:38:07Z

I just tested it on my laptop, which features an i7-8559U,
through an Ubuntu VM.

On latest dev branch, XXH3 is indeed slower than XXH32 in x86 (32-bit) mode when using gcc v7.4 compiler.

XXH32                    :     102400 ->    81824 it/s ( 7990.6 MB/s)
XXH64                    :     102400 ->    38282 it/s ( 3738.5 MB/s)
XXH3_64b                 :     102400 ->    64518 it/s ( 6300.6 MB/s)
XXH128                   :     102400 ->    58534 it/s ( 5716.2 MB/s)

However, if I use clang v6.0.0 instead,
then XXH3_64b becomes slightly faster :

XXH32                    :     102400 ->    79887 it/s ( 7801.4 MB/s)
XXH64                    :     102400 ->    52088 it/s ( 5086.7 MB/s)
XXH3_64b                 :     102400 ->    88511 it/s ( 8643.6 MB/s)
XXH128                   :     102400 ->    68324 it/s ( 6672.2 MB/s)

Note : I have to force -DXXH_VECTOR=0 at compilation time, otherwise, clang just auto-vectorizes the code with SSE2, and this is no longer comparable.

easyaspi314 · 2019-12-11T18:38:16Z

C:\Users\husse\source\repos\xxHash\xxhsum.exe 0.7.2 (32-bits i386 little endian), GCC 9.2.0, by Yann Collet
Sample of 100 KB...
XXH32                    :     102400 ->    82080 it/s ( 8015.6 MB/s)
XXH32 unaligned          :     102400 ->    81773 it/s ( 7985.6 MB/s)
XXH64                    :     102400 ->    29787 it/s ( 2908.9 MB/s)
XXH64 unaligned          :     102400 ->    29940 it/s ( 2923.8 MB/s)
XXH3_64b                 :     102400 ->    93970 it/s ( 9176.8 MB/s)
XXH3_64b unaligned       :     102400 ->    94028 it/s ( 9182.4 MB/s)
XXH3_64b seeded          :     102400 ->    95259 it/s ( 9302.7 MB/s)
XXH3_64b seeded unaligne :     102400 ->    94179 it/s ( 9197.2 MB/s)
XXH128                   :     102400 ->    75474 it/s ( 7370.5 MB/s)
XXH128 unaligned         :     102400 ->    72546 it/s ( 7084.6 MB/s)
XXH128 seeded            :     102400 ->    74622 it/s ( 7287.3 MB/s)
XXH128 seeded unaligned  :     102400 ->    72019 it/s ( 7033.1 MB/s)

GCC is clearly cheating, there is no way it is generating better code than Clang 😛

This is with the swapped multiply:

    XXH_ALIGN(XXH_ACC_ALIGN) xxh_u64* const xacc = (xxh_u64*) acc;    /* presumed aligned on 32-bytes boundaries, little hint for the auto-vectorizer */
    const xxh_u8* const xinput = (const xxh_u8*) input;  /* no alignment restriction */
    const xxh_u8* const xsecret  = (const xxh_u8*) secret;   /* no alignment restriction */
    size_t i;
    XXH_ASSERT(((size_t)acc & (XXH_ACC_ALIGN-1)) == 0);
    for (i=0; i < ACC_NB; i++) {
        xxh_u64 const data_val = XXH_readLE64(xinput + 8*i);
        xxh_u64 const data_key = data_val ^ XXH_readLE64(xsecret + i*8);
        xacc[i] += XXH_mult32to64(data_key & 0xFFFFFFFF, data_key >> 32);
        if (accWidth == XXH3_acc_64bits) {
            xacc[i] += data_val;
        } else {
            xacc[i ^ 1] += data_val; /* swap adjacent lanes */
        }
        
    }

Cyan4973 · 2019-12-11T18:48:16Z

It's funny that we get the gcc / clang speed difference reversed on our respective platforms...

easyaspi314 · 2019-12-11T18:49:48Z

Well you are using Clang 6.0, perhaps there was a regression…

Are you using -mno-sse2?

IIRC with SSE2+XXH_VECTOR=0, Clang will half vectorize it.

Cyan4973 · 2019-12-11T19:00:07Z

Indeed.

clang v6.0.0 is the default version provided with my Ubuntu VM.
However, it's possible to install and test up to v8.0.0.

So I benchmarked xxh with clang v8.0.0, and indeed, performance is worse than v6.0.0.

XXH32                    :     102400 ->    82794 it/s ( 8085.3 MB/s)
XXH64                    :     102400 ->    33493 it/s ( 3270.8 MB/s)
XXH3_64b                 :     102400 ->    55674 it/s ( 5437.0 MB/s)
XXH128                   :     102400 ->    45227 it/s ( 4416.7 MB/s)

Are you using -mno-sse2?

That was it. I was only setting -DXXH_VECTOR=0.
Adding -mno-sse2 flag, performance of clang v6.0.0 is now in line with v8.0.0 :

./xxhsum32 0.7.2 (32-bits i386 little endian), Clang 6.0.0 (tags/RELEASE_600/final), by Yann Collet
XXH32                    :     102400 ->    79827 it/s ( 7795.6 MB/s)
XXH64                    :     102400 ->    37508 it/s ( 3662.9 MB/s)
XXH3_64b                 :     102400 ->    55968 it/s ( 5465.6 MB/s)
XXH128                   :     102400 ->    45588 it/s ( 4452.0 MB/s)

easyaspi314 · 2019-12-11T19:04:18Z

I'm really curious now.

I mean scalar x86 isn't that important since SSE2 is safe to assume, but the fact that GCC is generating significantly faster code piques my interest.

hungptit · 2019-12-11T21:44:32Z

@easyaspi314 gcc is better than clang in some optimizations. If you try double y = 1.0 * x in the compiler explorer then you will see gcc is smarter than both clang and msvc.

easyaspi314 · 2019-12-11T22:15:38Z

/r/whoosh

easyaspi314 · 2019-12-11T22:37:44Z

We know that compilers optimize differently, and we have plenty of ugly hacks to correct misoptimizations/dumb expansions:

An inline assembly hack to prevent Clang from vectorizing XXH32 in SSE4.1
An intrinsic to prevent MSVC x86 from emitting erroneous __lmul calls in XXH3
An attribute to prevent GCC x86 from vectorizing the long multiply routine.
A builtin to prevent Clang from turning rotl->multiply to two multiplies

I was mostly making the joke because Clang usually is better with this stuff, and it is my favorite compiler.

Cyan4973 closed this as completed Jul 26, 2019

Cyan4973 changed the title ~~Fake issue, just to store graphs~~ Just a place to store graphs and benchmark results Aug 16, 2019

Cyan4973 mentioned this issue Aug 16, 2019

Add std::hash, clhash, and google's farmhash to the benchmark suite #160

Closed

Cyan4973 mentioned this issue Sep 24, 2019

Comparison between multiple 128-bit and 256-bit non-cryptographic hashes #257

Closed

8 tasks

wangyi-fudan mentioned this issue Mar 8, 2023

WyRand fails 64-bit 1-dimensional collision tests wangyi-fudan/wyhash#130

Closed

Just a place to store graphs and benchmark results #236

Just a place to store graphs and benchmark results #236

Comments

Cyan4973 commented Jul 26, 2019 • edited

Performance graphs, comparing XXH3 with other portable non-cryptographic hashes :

Cyan4973 commented Aug 16, 2019 • edited

Collision study

Testing 64-bit hashes on large inputs :

Testing 64-bit non-portable hashes :

Testing 64-bit hashes on small inputs :

Testing 128-bit hashes :

hungptit commented Aug 16, 2019

Cyan4973 commented Aug 16, 2019

hungptit commented Aug 16, 2019

Cyan4973 commented Aug 21, 2019 • edited

Cyan4973 commented Aug 22, 2019

Cyan4973 commented Aug 25, 2019

hungptit commented Aug 27, 2019 • edited

Cyan4973 commented Aug 29, 2019

hungptit commented Aug 30, 2019

easyaspi314 commented Aug 30, 2019

Cyan4973 commented Aug 30, 2019 • edited

svpv commented Sep 26, 2019

svpv commented Sep 26, 2019

easyaspi314 commented Sep 27, 2019

Cyan4973 commented Oct 4, 2019 • edited

easyaspi314 commented Nov 3, 2019 • edited

Why AES sucks for hashing

easyaspi314 commented Dec 7, 2019

Cyan4973 commented Dec 10, 2019

easyaspi314 commented Dec 10, 2019 • edited

Cyan4973 commented Dec 10, 2019 • edited

easyaspi314 commented Dec 11, 2019 • edited

easyaspi314 commented Dec 11, 2019

easyaspi314 commented Dec 11, 2019 • edited

Cyan4973 commented Dec 11, 2019

easyaspi314 commented Dec 11, 2019

Cyan4973 commented Dec 11, 2019 • edited

easyaspi314 commented Dec 11, 2019 • edited

Cyan4973 commented Dec 11, 2019

easyaspi314 commented Dec 11, 2019 • edited

Cyan4973 commented Dec 11, 2019

easyaspi314 commented Dec 11, 2019

hungptit commented Dec 11, 2019 • edited

easyaspi314 commented Dec 11, 2019

easyaspi314 commented Dec 11, 2019 • edited

Cyan4973 commented Jul 26, 2019 •

edited

Performance graphs, comparing `XXH3` with other portable non-cryptographic hashes :

Cyan4973 commented Aug 16, 2019 •

edited

Cyan4973 commented Aug 21, 2019 •

edited

hungptit commented Aug 27, 2019 •

edited

Cyan4973 commented Aug 30, 2019 •

edited

Cyan4973 commented Oct 4, 2019 •

edited

easyaspi314 commented Nov 3, 2019 •

edited

easyaspi314 commented Dec 10, 2019 •

edited

Cyan4973 commented Dec 10, 2019 •

edited

easyaspi314 commented Dec 11, 2019 •

edited

easyaspi314 commented Dec 11, 2019 •

edited

Cyan4973 commented Dec 11, 2019 •

edited

easyaspi314 commented Dec 11, 2019 •

edited

easyaspi314 commented Dec 11, 2019 •

edited

hungptit commented Dec 11, 2019 •

edited

easyaspi314 commented Dec 11, 2019 •

edited