Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Just a place to store graphs and benchmark results #236

Closed
Cyan4973 opened this issue Jul 26, 2019 · 35 comments
Closed

Just a place to store graphs and benchmark results #236

Cyan4973 opened this issue Jul 26, 2019 · 35 comments

Comments

@Cyan4973
Copy link
Owner

Cyan4973 commented Jul 26, 2019

Performance graphs, comparing XXH3 with other portable non-cryptographic hashes :

Bandwidth on large data (~100 KB)
H_bandwidth_bargraph

Bandwidth on x64 on variable size data of len 2^N
H_bandwidth_x64

Bandwidth on x86 on variable size data of len 2^N (32-bit friendliness)
H_bandwidth_x86 (1)

Following benchmarks concentrate on small data :

Throughput on small data of fixed length N
Throughput tests try to generate as many hashes as possible. This is the easiest methodology for benchmark measurements, but mostly representative of "batch mode" scenarios.
H_throughput_fixedS

Throughput on small data of random length [1-N]
When input size is a random length, it's much harder for the branch predictor to properly select the right path in the algorithm. The impact on performance can be major.
H_throughput_randomS

Latency on small data of fixed length N
Latency tests require the algorithm to complete one hash before starting the next one. This is generally more representative of situations where the hash is integrated into a larger algorithm.
H_latency_fixedS

Latency on small data of random length [1-N]
H_latency_randomS

@Cyan4973 Cyan4973 changed the title Fake issue, just to store graphs Just a place to store graphs and benchmark results Aug 16, 2019
@Cyan4973
Copy link
Owner Author

Cyan4973 commented Aug 16, 2019

Collision study

It's the first test I'm aware of able to directly probe the collision efficiency of 64-bit hashes (32-bit collision efficiency was already well covered by SMHasher). Unfortunately, 128-bit collision analysis is still way off.

The test generates a massive amount of hashes, and counts the number of collisions produced. It doesn't come cheap, and requires a lot of time, and a lot or RAM.
The generated input patterns are relatively close to each others (low hamming distances), with only a few bits of difference. It is supposed to be more difficult for low quality hashes.
The test indicates the nb of collisions expected for the amount of hashes. Measured nb of collisions does not have to be strictly identical, it's enough to be in the vicinity.

Testing 64-bit hashes on large inputs :

An input of 256 bytes is expected to always trigger the "long" mode for hashes algorithms featuring multiple modes depending on input length.

Algorithm Input Len Nb Hashes Expected Nb Collisions Notes
XXH3 256 100 Gi 312.5 326
XXH64 256 100 Gi 312.5 294
XXH128 low 64-bit 512 100 Gi 312.5 321
XXH128 high 64-bit 512 100 Gi 312.5 325
mmh64a 256 100 Gi 312.5 310
city64 256 100 Gi 312.5 303
t1ha2 256 100 Gi 312.5 761 a bit too large, loses ~1bit of distribution
mumv2 256 100 Gi 312.5 1229 a bit too large, loses ~2bits of distribution
wyhash 256 100 Gi 312.5 1202 a bit too large, loses ~2bits of distribution
fnv64 256 100 Gi 312.5 303 very long test : 23h30, compared with 3h20 for XXH3
blake2b low 64-bit 256 100 Gi 312.5 296 very long test : 48h !
md5 low 64-bit 256 100 Gi 312.5 301 very long test : 58h !

t1ha2 and mumv2 both feature a collision rate which is not ideal. t1ha2's distribution is closer to 63-bit, and mumv2 is closer to 62-bit. Note though that it's not "horrible", both can be considered "reasonable" for most practical usages.
wyhash being an unofficial variation of mumv2 , it unsurprisingly inherits the same distribution issues.

Testing 64-bit non-portable hashes :

AES hashes are based on Intel's AES instruction extension, introduced alongside SSE4.2. They don't run on a different platform.

Algorithm Type Input Len Nb Hashes Expected Nb Collisions Notes
meowhash low 64-bit AES 256 100 Gi 312.5 18200 pretty bad
aquahash low 64-bit AES 256 100 Gi 312.5 3488136 unusable

AES implying cryptographic quality, one would logically expect that it's also good enough for dispersion and randomness. This assumption appears incorrect: AES based hashes feature surprisingly bad collision properties.

The collision ratio is too high to recommend using these algorithms as checksums.

Testing 64-bit hashes on small inputs :

The test on input 8 bytes => 8 bytes output can check if there is some "space reduction" happening at this length. If there is no space reduction, then the transformation is a perfect bijection, which guarantees no collisions for 2 different inputs.

Algorithm Input Len Nb Hashes Expected Nb Collisions Notes
XXH3 8 100 Gi 312.5 0 XXH3 is bijective for len==8. It guarantees no collision.
XXH64 8 100 Gi 312.5 0 XXH64 is also bijective for len==8
mmh64a 8 100 Gi 312.5 0 bijective
city64 8 100 Gi 312.5 278
t1ha2 8 100 Gi 312.5 607 a bit too large, loses ~1bit of distribution
mumv2 8 100 Gi 312.5 544 a bit too large, loses ~1bit of distribution
wyhash 8 100 Gi 312.5 583 a bit too large, loses ~1bit of distribution
fnv64 8 100 Gi 312.5 26

Testing 128-bit hashes :

The only acceptable score for these tests is always 0. If the hash algorithm offers 128-bit of dispersion, the probability for a single collision to show up is smaller than winning the national lottery twice in a row.
In contrast with the 64-bit tests, due to resource limitation, the test does not provide a precise 128-bit collision estimation. It merely eliminates hashes with < 80-bit of effective distribution.

Algorithm Input Len Nb Hashes Expected Nb Collisions Notes
XXH128 256 100 Gi 0.0 0
cityhash128 256 100 Gi 0.0 0
t1ha2 128-bit 256 100 Gi 0.0 14 distribution equivalent to ~70 bits

Tests on small input lengths

Algorithm Input Len Nb Hashes Expected Nb Collisions Notes
XXH128 16 25 Gi 0.0 0 range 9-16
XXH128 32 25 Gi 0.0 0 range 17-128
XXH128 100 13 Gi 0.0 0 range 17-128
XXH128 200 13 Gi 0.0 0 range 129-240

@hungptit
Copy link

@Cyan4973 How can I reproduce your benchmark results? And if possible could you add wyhash into your graph? xxhash will be faster than wyhash for large enough size i.e about 256bytes in my tests, however, it would be very helpful to let users know about that.

image

@Cyan4973
Copy link
Owner Author

@hungptit , I haven't yet open-sourced the collision analyzer. This effort still needs to be done. Maybe I'll add it later in the repository, typically under /tests directory.

For the time being, I'm just populating a few test results, trying to complete a picture.
I will certainly add aquahash and maybe wyhash if you feel that's interesting too.
It takes some time though, as it requires typically several hours per measurement, on a special server that can only be accessed from time to time.

@hungptit
Copy link

@Cyan4973 I plan to send you a PM to ask about the suggestion for collision analyzer and luckily you point me to this thread. I am looking forward to your collision analyzer and I would assume other users also eager to try it given the number years that you have been working on the excellent cryptographic applications. I think wyhash is an excellent hash and it would be helpful to users if you can include it in your benchmark. AquaHash (created by Andrew Rogers) is new and have not been through any serious battle yet and it is not portable so you do not need to worry about it for now. I can help to add it to your collision analyzer later once you can open-source it.

@Cyan4973
Copy link
Owner Author

Cyan4973 commented Aug 21, 2019

I will move graphs and tables into the "wiki" section, where it's more appropriate as documentation, I suppose.

For the time being, the first page is here :
https://github.com/Cyan4973/xxHash/wiki/Performance-comparison

I will continue with the collision analysis results.

@Cyan4973
Copy link
Owner Author

Btw @hungptit , if you are interested in the performance graphs, as opposed to the collision analyzer, the benchmark tool is already available, within the repository at /tests/bench.
It's designed for (relatively) easy integration of other hashes.

@Cyan4973
Copy link
Owner Author

I completed a 64-bit collision test for some AES-based algorithms.
The results are devastating, especially for Aquahash. I cannot recommend using this hash for anything.

@hungptit
Copy link

hungptit commented Aug 27, 2019

@Cyan4973 It is bad news for me because I may not be able to use AquaHash for my local cache :) However, it is good news since you have confirmed that AquaHash may not be good enough for normal use cases and I need to figure out something else. Do you have any result for wyhash and clhash? And how about other AES-based algorithms such as meow hash?

I look forward to using your collision analysis tool in the near future so I can quickly evaluate the quality of any new hash algorithm i.e performance and quality. Let me know if you need help with testing and/or evaluating your analyzer.

@Cyan4973
Copy link
Owner Author

Do you have any result for wyhash and clhash?

I've updated the table with results from more hashes.
clhash still needs to be done.

@hungptit
Copy link

@Cyan4973 I think your benchmark + analysis is good enough to write an excellent article. I would love to read about how you compare hash functions mathematically with some benchmark numbers :)

@easyaspi314
Copy link
Contributor

Slight error: AES is SSE4.2.

@Cyan4973
Copy link
Owner Author

Cyan4973 commented Aug 30, 2019

On the Intel Intrinsic Guide, it's not even listed as part of SSE : it seems to be in its own category :
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=aes

That's may be why it has its own switch -maes.

That being said, it was released "at the same time as", say, SSE4.2, so it's easy to make the equivalence.

@svpv
Copy link

svpv commented Sep 26, 2019

Was it meowhash 0.5 or a previous version?

@svpv
Copy link

svpv commented Sep 26, 2019

The 1-bit loss in t1ha2 is mostly due to wide multiplication as one of the last steps in mixing the final result. The thing is, the operation of the form lo, hi = acc * K, lo ^ hi is not invertible. In this gist I studied 32x32 multiplication exhaustively. The collision rate is worse with lo ^ hi than with lo + hi, but lo + hi retains more bit-level correlations. On the other hand, wide multiplication can be viewed as a security primitive: it seems that you can't get back to the data other than by brute force, which is in about 2^32 or 2^64 steps (I tried z3 solver and indeed I couldn't).

@easyaspi314
Copy link
Contributor

You don't need that much brute force to understand that.

Just print out a small table, as multiplication properties scale.

      add       |       xor      
0 0 0 0 0 0 0 0 | 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 | 0 1 2 3 4 5 6 7
0 2 4 6 1 3 5 7 | 0 2 4 6 1 3 5 7
0 3 6 2 5 0 4 7 | 0 3 6 0 5 6 0 7
0 4 1 5 2 6 3 7 | 0 4 1 5 2 6 3 7
0 5 3 0 6 4 1 7 | 0 5 3 6 6 2 5 7
0 6 5 4 3 1 0 7 | 0 6 5 0 3 5 0 7
0 7 7 7 7 7 7 7 | 0 7 7 7 7 7 7 7
0:   18 (28.1%)   19 (29.7%)
1:    5 (7.8%)     3 (4.7%)
2:    4 (6.2%)     4 (6.2%)
3:    6 (9.4%)     6 (9.4%)
4:    6 (9.4%)     3 (4.7%)
5:    6 (9.4%)     8 (12.5%)
6:    6 (9.4%)     8 (12.5%)
7:   13 (20.3%)   13 (20.3%)

You can pretty clearly see that xor breaks up some of the patterns, but has more bias on the high bits because it saturates instead of overflowing. It is also less symmetrical.

That is the main reason for xor folding those high bits every so often. That spreads stuff to the lower bits which don't fare well.

@Cyan4973
Copy link
Owner Author

Cyan4973 commented Oct 4, 2019

Good news :
I can finally measure collision ratio of 128-bit hash functions.
So I will start updating the tables with these results.

The primary intention is to check that XXH128() delivers a perfect collision score (0) at multiple input length, to validate each internal variant.

The collision generator cannot exactly reach 128-bit level, that would require an insane amount of resources. But it can get close to ~80-bit, and that's enough to break some hash algorithms which announce 128-bit of variance. It's an interesting result, because such limitation was invisible up to now.

@easyaspi314
Copy link
Contributor

easyaspi314 commented Nov 3, 2019

By the way, I thought I said this before but I guess I didn't.

AES implying cryptographic quality, one would logically expect that it's also good enough for dispersion and randomness. This assumption appears incorrect: AES based hashes feature surprisingly bad collision properties.

That is because AES isn't a hash function. It is an encryption algorithm.

Why AES sucks for hashing

Encryption is just a fancy permutation, and it is designed to be reversible.

Hashing is more than a fancy permutation, and it is designed to be irreversible and highlight one-bit changes by changing basically most the bits.

AES doesn't, and it fails miserably at the last property. Here is how AES (from _mm_aesenc_si128) reacts to single bit changes:

Key:
f181bfe03e942b6332fb552248a63051
e9254323fdb11aff3b13d720f18bbfe2 189ec62f3f0478a6a65b841445333341
e9354323fdb11aff3b13d720f18bbfe2 189ec62f3f0478a6a65b8414a57a9ae8
  ^^

7ec24a13eead45e9036831a9988292be c1bfa5fb939a5feb39d9a85616cbe16d
7ec24a13eead45e9236831a9988292be c1bfa5fb939a5feb8384f5b116cbe16d
                ^^

b6bb76d0bab1e391d1d41c90b6280f79 044ddb11131aebbaab64b775807416b8
b6bb76d0bab1e391d1d41c90b6280779 044ddb11a0d49609ab64b775807416b8
                            ^^

226720675023cf81cc68035f26c815e1 252beff88b231b9b3cb597b86c0a07d9
226720675023cf01cc68035f26c815e1 252beff88b231b9b4cc507586c0a07d9
              ^^

e59cf0c82dc29d4a52537261cce58333 b4119d68bd4afeda180beec0164733c9
a59cf0c82dc29d4a52537261cce58333 11ce4212bd4afeda180beec0164733c9
^^

eb8329ba04f622075548d06a2a0f4fc6 3ac8338147a19f4b2bdada7f272c021e
eb8329ba04f622075548d06a2a0f4f46 d4261a4647a19f4b2bdada7f272c021e
                              ^^

17f3c1b43d1408b075d495f9089be48b 08dda6414c151a2ab0211e279645dd15
17f3c1b43d1408b055d495f9089be48b 08dda6414c151a2a72407f849645dd15
                ^^

9e90bbc1971009677a3376c9f976e0ed cf18f1a5f2c13132baf1ec941d21b721
9e90bbc1971089677a3376c9f976e0ed cf18f1a5f2c13132baf1ec94bbd0e087
            ^^

As you can see, only 32 bits are affected, since it is treated as separate lanes.

Source for output (not my best code lol)
// _mm_aesenc_si128 equivalent
// https://github.com/veorq/aesenc-noNI, modified to not overwrite
#include <stdint.h>

static const uint8_t sbox[256] =
{ 0x63, 0x7c, 0x77, 0x7b, 0xf2, 0x6b, 0x6f, 0xc5, 0x30, 0x01, 0x67, 0x2b, 0xfe,
  0xd7, 0xab, 0x76, 0xca, 0x82, 0xc9, 0x7d, 0xfa, 0x59, 0x47, 0xf0, 0xad, 0xd4,
  0xa2, 0xaf, 0x9c, 0xa4, 0x72, 0xc0, 0xb7, 0xfd, 0x93, 0x26, 0x36, 0x3f, 0xf7,
  0xcc, 0x34, 0xa5, 0xe5, 0xf1, 0x71, 0xd8, 0x31, 0x15, 0x04, 0xc7, 0x23, 0xc3,
  0x18, 0x96, 0x05, 0x9a, 0x07, 0x12, 0x80, 0xe2, 0xeb, 0x27, 0xb2, 0x75, 0x09,
  0x83, 0x2c, 0x1a, 0x1b, 0x6e, 0x5a, 0xa0, 0x52, 0x3b, 0xd6, 0xb3, 0x29, 0xe3,
  0x2f, 0x84, 0x53, 0xd1, 0x00, 0xed, 0x20, 0xfc, 0xb1, 0x5b, 0x6a, 0xcb, 0xbe,
  0x39, 0x4a, 0x4c, 0x58, 0xcf, 0xd0, 0xef, 0xaa, 0xfb, 0x43, 0x4d, 0x33, 0x85,
  0x45, 0xf9, 0x02, 0x7f, 0x50, 0x3c, 0x9f, 0xa8, 0x51, 0xa3, 0x40, 0x8f, 0x92,
  0x9d, 0x38, 0xf5, 0xbc, 0xb6, 0xda, 0x21, 0x10, 0xff, 0xf3, 0xd2, 0xcd, 0x0c,
  0x13, 0xec, 0x5f, 0x97, 0x44, 0x17, 0xc4, 0xa7, 0x7e, 0x3d, 0x64, 0x5d, 0x19,
  0x73, 0x60, 0x81, 0x4f, 0xdc, 0x22, 0x2a, 0x90, 0x88, 0x46, 0xee, 0xb8, 0x14,
  0xde, 0x5e, 0x0b, 0xdb, 0xe0, 0x32, 0x3a, 0x0a, 0x49, 0x06, 0x24, 0x5c, 0xc2,
  0xd3, 0xac, 0x62, 0x91, 0x95, 0xe4, 0x79, 0xe7, 0xc8, 0x37, 0x6d, 0x8d, 0xd5,
  0x4e, 0xa9, 0x6c, 0x56, 0xf4, 0xea, 0x65, 0x7a, 0xae, 0x08, 0xba, 0x78, 0x25,
  0x2e, 0x1c, 0xa6, 0xb4, 0xc6, 0xe8, 0xdd, 0x74, 0x1f, 0x4b, 0xbd, 0x8b, 0x8a,
  0x70, 0x3e, 0xb5, 0x66, 0x48, 0x03, 0xf6, 0x0e, 0x61, 0x35, 0x57, 0xb9, 0x86,
  0xc1, 0x1d, 0x9e, 0xe1, 0xf8, 0x98, 0x11, 0x69, 0xd9, 0x8e, 0x94, 0x9b, 0x1e,
  0x87, 0xe9, 0xce, 0x55, 0x28, 0xdf, 0x8c, 0xa1, 0x89, 0x0d, 0xbf, 0xe6, 0x42,
  0x68, 0x41, 0x99, 0x2d, 0x0f, 0xb0, 0x54, 0xbb, 0x16 };

#define XT(x) (((x) << 1) ^ ((((x) >> 7) & 1) * 0x1b))

void aesenc (uint8_t *out, const uint8_t *s, const uint8_t *rk) {
    uint8_t i, t, u, v[4][4];
    for (i = 0; i < 16; ++i) v[((i / 4) + 4 - (i%4) ) % 4][i % 4] = sbox[s[i]];
    for (i = 0; i < 4; ++i) {
        t = v[i][0];
        u = v[i][0] ^ v[i][1] ^ v[i][2] ^ v[i][3];
        v[i][0] ^= u ^ XT(v[i][0] ^ v[i][1]);
        v[i][1] ^= u ^ XT(v[i][1] ^ v[i][2]);
        v[i][2] ^= u ^ XT( v[i][2] ^ v[i][3]);
        v[i][3] ^= u ^ XT(v[i][3] ^ t);
    }
    for (i = 0; i < 16; ++i) out[i] = v[i / 4][i % 4] ^ rk[i];
}

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

static void print_buf(const uint8_t *buf)
{
    for (int i = 0; i < 16; i++) {
        printf("%02x", buf[i]);
    }
}

int main(void)
{
    srand(time(NULL));
    uint8_t input[16], output[16], key[16];

    for (int i = 0; i < 16; i++) {
        key[i] = rand();
    }
    printf("Key:\n");
    print_buf(key);
    for (int j = 0; j < 8; j++) {
        for (int i = 0; i < 16; i++) {
            input[i] = rand();
        }
        printf("\n");
        print_buf(input);
        printf(" ");
        aesenc(output, input, key);
        print_buf(output);
        int mask = rand();
        input[(mask / 8) % 16] ^= (1 << (mask % 8));

        printf("\n");
        print_buf(input);
        printf(" ");
        aesenc(output, input, key);
        print_buf(output);
        printf("\n%*s\n", ((mask / 8) % 16) * 2 + 2, "^^");
    }
}

Additionally, AES cancels itself out:

aesenc(x, aesenc(x, y)) == y

While this is fine for encryption since you only use input once and key is constant, in hashing, this is a terrible property, since you can just erase the existence of something.

So if you had a hash that looked like this....

uint128_t hash(uint128_t seed, const uint8_t *input, size_t len)
{
    uint128_t acc = seed;
    for (size_t i = 0; i < len / 16; i++) {
        acc ^= aesenc(read128(input + 16 * i), acc);
    }
    if (len % 16) {
        acc ^= aesenc(readpartial128(input + (len & ~(size_t)15), len & 15), acc);
    }
    return acc;
}

You have 30 seconds to find at least 340282366920938463463374607431768211456 length extension collisions with this:

00112233445566778899AABBCCDDEEFF

Lastly, AES is terribly slow if you don't have AES-NI or ARM crypto. While AES-NI is pretty common, with all Intels >= Nehalem and AMDs >= Bulldozer (iirc), ARM crypto is a pretty recent optional extension that many manufacturers leave out for cost and simplicity (see also: integer division in ARMv7)

And we have more than proven that you can write a good, fast hash function without any non-portable instructions.

Just see all xxHash variants and basically any scalar hash.

@easyaspi314
Copy link
Contributor

Yawn, hopefully back from my hiatus and I got a new PC!

AMD Ryzen 5 3600
16 GB RAM
Windows 10
Clang 9.0.0 (from VS2019)

Scalar

xxhsum.exe 0.7.2 (64-bits x86_64 little endian), Clang 9.0.0 (tags/RELEASE_900/final), by Yann Collet
Sample of 100 KB...
XXH32                    :     102400 ->    18963 it/s ( 1851.9 MB/s)
XXH32 unaligned          :     102400 ->    18904 it/s ( 1846.1 MB/s)
XXH64                    :     102400 ->    36571 it/s ( 3571.4 MB/s)
XXH64 unaligned          :     102400 ->    36572 it/s ( 3571.5 MB/s)
XXH3_64b                 :     102400 ->    25869 it/s ( 2526.2 MB/s)
XXH3_64b unaligned       :     102400 ->    26034 it/s ( 2542.4 MB/s)
XXH3_64b seeded          :     102400 ->    25815 it/s ( 2521.0 MB/s)
XXH3_64b seeded unaligne :     102400 ->    26034 it/s ( 2542.4 MB/s)
XXH128                   :     102400 ->    26034 it/s ( 2542.4 MB/s)
XXH128 unaligned         :     102400 ->    25815 it/s ( 2521.0 MB/s)
XXH128 seeded            :     102400 ->    25815 it/s ( 2521.0 MB/s)
XXH128 seeded unaligned  :     102400 ->    25815 it/s ( 2521.0 MB/s)

SSE2

xxhsum.exe 0.7.2 (64-bits x86_64 + SSE2 little endian), Clang 9.0.0 (tags/RELEASE_900/final), by Yann Collet
Sample of 100 KB...
XXH32                    :     102400 ->    18963 it/s ( 1851.9 MB/s)
XXH32 unaligned          :     102400 ->    19081 it/s ( 1863.4 MB/s)
XXH64                    :     102400 ->    36830 it/s ( 3596.7 MB/s)
XXH64 unaligned          :     102400 ->    37012 it/s ( 3614.5 MB/s)
XXH3_64b                 :     102400 ->   307200 it/s (30000.0 MB/s)
XXH3_64b unaligned       :     102400 ->   307200 it/s (30000.0 MB/s)
XXH3_64b seeded          :     102400 ->   302602 it/s (29551.0 MB/s)
XXH3_64b seeded unaligne :     102400 ->   307200 it/s (30000.0 MB/s)
XXH128                   :     102400 ->   279273 it/s (27272.7 MB/s)
XXH128 unaligned         :     102400 ->   279273 it/s (27272.7 MB/s)
XXH128 seeded            :     102400 ->   279273 it/s (27272.7 MB/s)
XXH128 seeded unaligned  :     102400 ->   279273 it/s (27272.7 MB/s)

AVX2 (hot damn!)

xxhsum.exe 0.7.2 (64-bits x86_64 + AVX2 little endian), Clang 9.0.0 (tags/RELEASE_900/final), by Yann Collet
Sample of 100 KB...
XXH32                    :     102400 ->    19081 it/s ( 1863.4 MB/s)
XXH32 unaligned          :     102400 ->    19021 it/s ( 1857.5 MB/s)
XXH64                    :     102400 ->    37463 it/s ( 3658.5 MB/s)
XXH64 unaligned          :     102400 ->    37463 it/s ( 3658.5 MB/s)
XXH3_64b                 :     102400 ->   768000 it/s (75000.0 MB/s)
XXH3_64b unaligned       :     102400 ->   768000 it/s (75000.0 MB/s)
XXH3_64b seeded          :     102400 ->   768000 it/s (75000.0 MB/s)
XXH3_64b seeded unaligne :     102400 ->   614400 it/s (60000.0 MB/s)
XXH128                   :     102400 ->   614400 it/s (60000.0 MB/s)
XXH128 unaligned         :     102400 ->   614400 it/s (60000.0 MB/s)
XXH128 seeded            :     102400 ->   614400 it/s (60000.0 MB/s)
XXH128 seeded unaligned  :     102400 ->   614400 it/s (60000.0 MB/s)

@Cyan4973
Copy link
Owner Author

Nice new config @easyaspi314 !

And an AMD platform is welcomed, as it provides a pretty nice comparison point to measurements on Intel ones !

2 details that stand out :

  • The scalar code path seems very slow, including for legacy XXH32 and XXH64. That's really unexpected. I'm wondering if there is an explanation.
  • Some of the produced figures, especially on the vector side, feel too "round". I'm wondering if there is some sort of side effect of limited timer resolution. I'm going to check that into the code.
    In the meantime, I suspect that benchHash would provide some more reliable figures, as its timer logic is more sophisticated.

@easyaspi314
Copy link
Contributor

easyaspi314 commented Dec 10, 2019

Oh oof, just realized that CMake was on RelWithDbgInfo which does /Ob1.

That would explain it, since the read functions are just static

I'll get a new bench later.

@Cyan4973
Copy link
Owner Author

Cyan4973 commented Dec 10, 2019

I've also updated the internal benchmark function
so that the risk of producing a significant rounding error
due to a too small measurement period compared to timer's resolution
is now better controlled.

@easyaspi314
Copy link
Contributor

easyaspi314 commented Dec 11, 2019

Timer inaccuracy would explain a lot.
My old Dell would get 4800.0 on XXH3 and randomly jump to 9600.0.

Since it definitely wasn't going into super saiyan mode (Core 2 has no Turbo Boost), the best explanation is an inaccurate timer.

There were other places where I got an even multiple, like the POWER9 which got 30000.0.

@easyaspi314
Copy link
Contributor

This is a lot more believable!
Scalar

xxhsum.exe 0.7.2 (64-bits x86_64 little endian), Clang 9.0.0 (tags/RELEASE_900/final), by Yann Collet
Sample of 100 KB...
XXH32                    :     102400 ->    80603 it/s ( 7871.4 MB/s)
XXH32 unaligned          :     102400 ->    80858 it/s ( 7896.3 MB/s)
XXH64                    :     102400 ->   161041 it/s (15726.6 MB/s)
XXH64 unaligned          :     102400 ->   160885 it/s (15711.4 MB/s)
XXH3_64b                 :     102400 ->   107670 it/s (10514.6 MB/s)
XXH3_64b unaligned       :     102400 ->   106251 it/s (10376.1 MB/s)
XXH3_64b seeded          :     102400 ->   108521 it/s (10597.8 MB/s)
XXH3_64b seeded unaligne :     102400 ->   105597 it/s (10312.2 MB/s)
XXH128                   :     102400 ->    97262 it/s ( 9498.3 MB/s)
XXH128 unaligned         :     102400 ->    97017 it/s ( 9474.3 MB/s)
XXH128 seeded            :     102400 ->    97256 it/s ( 9497.7 MB/s)
XXH128 seeded unaligned  :     102400 ->    96612 it/s ( 9434.8 MB/s)

SSE2

xxhsum.exe 0.7.2 (64-bits x86_64 + SSE2 little endian), Clang 9.0.0 (tags/RELEASE_900/final), by Yann Collet
Sample of 100 KB...
XXH32                    :     102400 ->    81174 it/s ( 7927.1 MB/s)
XXH32 unaligned          :     102400 ->    81086 it/s ( 7918.6 MB/s)
XXH64                    :     102400 ->   160899 it/s (15712.8 MB/s)
XXH64 unaligned          :     102400 ->   161041 it/s (15726.7 MB/s)
XXH3_64b                 :     102400 ->   310023 it/s (30275.7 MB/s)
XXH3_64b unaligned       :     102400 ->   307509 it/s (30030.2 MB/s)
XXH3_64b seeded          :     102400 ->   308900 it/s (30166.0 MB/s)
XXH3_64b seeded unaligne :     102400 ->   305375 it/s (29821.8 MB/s)
XXH128                   :     102400 ->   275146 it/s (26869.7 MB/s)
XXH128 unaligned         :     102400 ->   272461 it/s (26607.6 MB/s)
XXH128 seeded            :     102400 ->   274429 it/s (26799.7 MB/s)
XXH128 seeded unaligned  :     102400 ->   274066 it/s (26764.3 MB/s)

AVX2

xxhsum.exe 0.7.2 (64-bits x86_64 + AVX2 little endian), Clang 9.0.0 (tags/RELEASE_900/final), by Yann Collet
Sample of 100 KB...
XXH32                    :     102400 ->    81895 it/s ( 7997.6 MB/s)
XXH32 unaligned          :     102400 ->    81801 it/s ( 7988.4 MB/s)
XXH64                    :     102400 ->   162077 it/s (15827.8 MB/s)
XXH64 unaligned          :     102400 ->   161465 it/s (15768.1 MB/s)
XXH3_64b                 :     102400 ->   629577 it/s (61482.1 MB/s)
XXH3_64b unaligned       :     102400 ->   628594 it/s (61386.1 MB/s)
XXH3_64b seeded          :     102400 ->   628080 it/s (61335.9 MB/s)
XXH3_64b seeded unaligne :     102400 ->   626429 it/s (61174.7 MB/s)
XXH128                   :     102400 ->   573136 it/s (55970.3 MB/s)
XXH128 unaligned         :     102400 ->   573135 it/s (55970.2 MB/s)
XXH128 seeded            :     102400 ->   571006 it/s (55762.3 MB/s)
XXH128 seeded unaligned  :     102400 ->   569946 it/s (55658.8 MB/s)

Everything lines up properly now, and the results are a lot more accurate.

Especially scalar which fits the "roughly 1.5x the speed of XXH32" pattern I have typically seen.

@easyaspi314
Copy link
Contributor

easyaspi314 commented Dec 11, 2019

Here is x86 scalar (with XXH3_hashLong_internal_loop forced inline)

xxhsum.exe 0.7.2 (32-bits i386 little endian), Clang 9.0.0 (tags/RELEASE_900/final), by Yann Collet
Sample of 100 KB...
XXH32                    :     102400 ->    81161 it/s ( 7925.9 MB/s)
XXH32 unaligned          :     102400 ->    80282 it/s ( 7840.0 MB/s)
XXH64                    :     102400 ->    29157 it/s ( 2847.4 MB/s)
XXH64 unaligned          :     102400 ->    29171 it/s ( 2848.8 MB/s)
XXH3_64b                 :     102400 ->    59114 it/s ( 5772.9 MB/s)
XXH3_64b unaligned       :     102400 ->    58609 it/s ( 5723.5 MB/s)
XXH3_64b seeded          :     102400 ->    58959 it/s ( 5757.7 MB/s)
XXH3_64b seeded unaligne :     102400 ->    58092 it/s ( 5673.1 MB/s)
XXH128                   :     102400 ->    62074 it/s ( 6061.9 MB/s)
XXH128 unaligned         :     102400 ->    61971 it/s ( 6051.8 MB/s)
XXH128 seeded            :     102400 ->    61136 it/s ( 5970.3 MB/s)
XXH128 seeded unaligned  :     102400 ->    60714 it/s ( 5929.1 MB/s)

Also matches the "3/4 the speed of XXH32 on 32-bit" pattern as well. Everything seems to check out.

I say definitely remove this hack:

#if defined(__clang__) && (XXH_VECTOR==0) && !defined(__AVX2__) && !defined(__arm__) && !defined(__thumb__)
static void
#else
XXH_FORCE_INLINE void
#endif

@Cyan4973
Copy link
Owner Author

I kind of remember that XXH3 used to be slightly faster than XXH32 on 32-bit x86 (scalar code path) (see #236 (comment)).

So, somehow, this property was lost during later updates ?

@easyaspi314
Copy link
Contributor

I do see a few tweaks that change the performance, such as putting the multiply over the block, but I am thinking that it is just that XXH32 is faster in relation (AMD seems to have a better ROL) or compiler differences (you used GCC, no?).

I am going to try in msys2 with GCC.

@Cyan4973
Copy link
Owner Author

Cyan4973 commented Dec 11, 2019

I just tested it on my laptop, which features an i7-8559U,
through an Ubuntu VM.

On latest dev branch, XXH3 is indeed slower than XXH32 in x86 (32-bit) mode when using gcc v7.4 compiler.

XXH32                    :     102400 ->    81824 it/s ( 7990.6 MB/s)
XXH64                    :     102400 ->    38282 it/s ( 3738.5 MB/s)
XXH3_64b                 :     102400 ->    64518 it/s ( 6300.6 MB/s)
XXH128                   :     102400 ->    58534 it/s ( 5716.2 MB/s)

However, if I use clang v6.0.0 instead,
then XXH3_64b becomes slightly faster :

XXH32                    :     102400 ->    79887 it/s ( 7801.4 MB/s)
XXH64                    :     102400 ->    52088 it/s ( 5086.7 MB/s)
XXH3_64b                 :     102400 ->    88511 it/s ( 8643.6 MB/s)
XXH128                   :     102400 ->    68324 it/s ( 6672.2 MB/s)

Note : I have to force -DXXH_VECTOR=0 at compilation time, otherwise, clang just auto-vectorizes the code with SSE2, and this is no longer comparable.

@easyaspi314
Copy link
Contributor

easyaspi314 commented Dec 11, 2019

C:\Users\husse\source\repos\xxHash\xxhsum.exe 0.7.2 (32-bits i386 little endian), GCC 9.2.0, by Yann Collet
Sample of 100 KB...
XXH32                    :     102400 ->    82080 it/s ( 8015.6 MB/s)
XXH32 unaligned          :     102400 ->    81773 it/s ( 7985.6 MB/s)
XXH64                    :     102400 ->    29787 it/s ( 2908.9 MB/s)
XXH64 unaligned          :     102400 ->    29940 it/s ( 2923.8 MB/s)
XXH3_64b                 :     102400 ->    93970 it/s ( 9176.8 MB/s)
XXH3_64b unaligned       :     102400 ->    94028 it/s ( 9182.4 MB/s)
XXH3_64b seeded          :     102400 ->    95259 it/s ( 9302.7 MB/s)
XXH3_64b seeded unaligne :     102400 ->    94179 it/s ( 9197.2 MB/s)
XXH128                   :     102400 ->    75474 it/s ( 7370.5 MB/s)
XXH128 unaligned         :     102400 ->    72546 it/s ( 7084.6 MB/s)
XXH128 seeded            :     102400 ->    74622 it/s ( 7287.3 MB/s)
XXH128 seeded unaligned  :     102400 ->    72019 it/s ( 7033.1 MB/s)

thonk

GCC is clearly cheating, there is no way it is generating better code than Clang 😛

This is with the swapped multiply:

    XXH_ALIGN(XXH_ACC_ALIGN) xxh_u64* const xacc = (xxh_u64*) acc;    /* presumed aligned on 32-bytes boundaries, little hint for the auto-vectorizer */
    const xxh_u8* const xinput = (const xxh_u8*) input;  /* no alignment restriction */
    const xxh_u8* const xsecret  = (const xxh_u8*) secret;   /* no alignment restriction */
    size_t i;
    XXH_ASSERT(((size_t)acc & (XXH_ACC_ALIGN-1)) == 0);
    for (i=0; i < ACC_NB; i++) {
        xxh_u64 const data_val = XXH_readLE64(xinput + 8*i);
        xxh_u64 const data_key = data_val ^ XXH_readLE64(xsecret + i*8);
        xacc[i] += XXH_mult32to64(data_key & 0xFFFFFFFF, data_key >> 32);
        if (accWidth == XXH3_acc_64bits) {
            xacc[i] += data_val;
        } else {
            xacc[i ^ 1] += data_val; /* swap adjacent lanes */
        }
        
    }

@Cyan4973
Copy link
Owner Author

It's funny that we get the gcc / clang speed difference reversed on our respective platforms...

@easyaspi314
Copy link
Contributor

easyaspi314 commented Dec 11, 2019

Well you are using Clang 6.0, perhaps there was a regression…

Are you using -mno-sse2?

IIRC with SSE2+XXH_VECTOR=0, Clang will half vectorize it.

@Cyan4973
Copy link
Owner Author

Indeed.

clang v6.0.0 is the default version provided with my Ubuntu VM.
However, it's possible to install and test up to v8.0.0.

So I benchmarked xxh with clang v8.0.0, and indeed, performance is worse than v6.0.0.

XXH32                    :     102400 ->    82794 it/s ( 8085.3 MB/s)
XXH64                    :     102400 ->    33493 it/s ( 3270.8 MB/s)
XXH3_64b                 :     102400 ->    55674 it/s ( 5437.0 MB/s)
XXH128                   :     102400 ->    45227 it/s ( 4416.7 MB/s)

Are you using -mno-sse2?

That was it. I was only setting -DXXH_VECTOR=0.
Adding -mno-sse2 flag, performance of clang v6.0.0 is now in line with v8.0.0 :

./xxhsum32 0.7.2 (32-bits i386 little endian), Clang 6.0.0 (tags/RELEASE_600/final), by Yann Collet
XXH32                    :     102400 ->    79827 it/s ( 7795.6 MB/s)
XXH64                    :     102400 ->    37508 it/s ( 3662.9 MB/s)
XXH3_64b                 :     102400 ->    55968 it/s ( 5465.6 MB/s)
XXH128                   :     102400 ->    45588 it/s ( 4452.0 MB/s)

@easyaspi314
Copy link
Contributor

I'm really curious now.

I mean scalar x86 isn't that important since SSE2 is safe to assume, but the fact that GCC is generating significantly faster code piques my interest.

@hungptit
Copy link

hungptit commented Dec 11, 2019

@easyaspi314 gcc is better than clang in some optimizations. If you try double y = 1.0 * x in the compiler explorer then you will see gcc is smarter than both clang and msvc.

@easyaspi314
Copy link
Contributor

/r/whoosh

@easyaspi314
Copy link
Contributor

easyaspi314 commented Dec 11, 2019

We know that compilers optimize differently, and we have plenty of ugly hacks to correct misoptimizations/dumb expansions:

I was mostly making the joke because Clang usually is better with this stuff, and it is my favorite compiler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants