Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions list for future evolutions #458

Open
Cyan4973 opened this issue Oct 1, 2020 · 17 comments
Open

Suggestions list for future evolutions #458

Cyan4973 opened this issue Oct 1, 2020 · 17 comments
Assignees

Comments

@Cyan4973
Copy link
Owner

Cyan4973 commented Oct 1, 2020

List updated June 30th 2023 :

Objectives for v0.8.2 - Completed

Objectives for v0.8.3

Objectives for v0.9.0

  • Stabilize XXH_generateSecret()
  • Remove legacy XXH_OLDNAME (or maybe v1.0 ?)
  • Split xxhash.h into multiple smaller files, and ship an ability to create a single amalgamated file from them ? (tentative)
    • Complete that with an ability to be selective about which part of the library is actually included in the amalgamated artifact, for source and binary size control.
    • provide a more direct way to selectively compile each algorithm, such as XXH_ENABLE_XXH32, XXH_ENABLE_XXH64 and XXH_ENABLE_XXH3
      • Currently, it's possible to compile only XXH32 (with XXH_NO_LONG_LONG), or XXH32 + XXH64 (with XXH_NO_XXH3) or everything. It's not possible to compile selectively only XXH64 for example.
  • Ship libxxhash dynamic library with runtime vector extension detection enabled by default (xxh_x64dispatch) ? (tentative)
    • Note : already possible on demand, by providing DISPATCH=1 makefile parameter. In which case, it unconditionally adds xxh_x86dispatch.c to the unit list, which crashes on non-x86 targets. Making this option default requires a safe capability to detect x86 targets at build or compilation time.
    • But in this case, the objective would be to remain transparent for user applications. DISPATCH must therefore substitute its symbols to original ones.
@easyaspi314
Copy link
Contributor

easyaspi314 commented Oct 1, 2020

Another one I was thinking was an option to disable streaming.

The streaming API takes up a good chunk of the binary size:

Clang 10.0.1, aarch64, Termux

$ clang -Oz -shared -fPIC xxhash.c -s -o libxxh_stream.so
$ clang -Oz -shared -fPIC xxhash.c -s -o libxxh_nostream.so -DXXH_NO_STREAM
$ size -A *.so | grep -E '(:|Total)'
libxxh_nostream.so  :
Total                 7219
libxxh_stream.so  :
Total                 12848

Edit: -O3:

libxxh_nostream.so  :
Total                 21966
libxxh_stream.so  :
Total                 35067

@easyaspi314
Copy link
Contributor

easyaspi314 commented Oct 3, 2020

Another cheat optimization is to use __attribute__((__pure__)) on as many xxHash functions as possible.

It says, roughly:

  1. This function has no side effects
  2. This function will return the same value given that the parameters, memory pointed to by the parameters, and global variables are not modified.
  3. This function can safely have repeated calls eliminated by the compiler as long as nothing else changes.

Basically, __pure__ is the magic behind the strlen optimization. (Excluding the second optimization where the compiler treats it as a built-in function and inlines/const props it)

const char *bad_strchr(const char *s, int c)
{
    for (size_t i = 0; i < strlen(s) /* EEK */; i++) {
        if ((unsigned char)s[i] == (unsigned char)c) {
            return &s[i];
        }
    }
    return NULL;
}

Any decent compiler will change it to this:

const char *bad_strchr(const char *s, int c)
{
    const size_t len = strlen(s); // strlen is pure, we only need to call it once
    for (size_t i = 0; i < len; i++) {
        if ((unsigned char)s[i] == (unsigned char)c) {
            return &s[i];
        }
    }
    return NULL;
}

Although it is easiest to see with this code:

size_t strlenx2(const char *s)
{
     return strlen(s) + strlen(s);
}

Equivalent code with how Clang shuffles registers on x86_64:

// optimized
size_t strlenx2(const char *s)
{
    size_t len = strlen(s);
    len += len;
    return len;
}
// unoptimized
size_t strlenx2(const char *s)
{
    size_t tmp_len; // r14
    const char *tmp_s; // rbx
    tmp_s = s;
    size_t len = strlen(s);
    tmp_len = len;
    s = tmp_s;
    len = strlen(s);
    len += tmp_len;
    return len;
}

This could possibly improve performance on some hash tables depending on how they are used. Primarily thinking of code that looks like this:

    table[key].foo = "Foo";
    table[key].bar = "Bar";

Note that the compiler sometimes can figure this out on its own if xxHash is inlined, but this applies to both inline and extern functions.

I would have an option added to explicitly disable it for a fair benchmark.

Other ideas:

  • add the overloads to allow at least XXH3_64bits to drop in as a replacement for common std::hash uses in C++
  • Add type safe overload inlines for C++, which can be opted out of
    • Checks if the pointed objects are sane types to hash (std::is_trivial, std::has_unique_object_representations in C++17, check len % sizeof(T))
  • add a noexcept/__attribute__((__nothrow__)) specifier since the single shot functions will never throw an exception allowing the compiler to leave out unwind tables and shrink code size a bit.

@Cyan4973
Copy link
Owner Author

Cyan4973 commented Oct 3, 2020

Yes, I like pure functions, so I'm all for it.
Also note the existence of const functions in gcc, with an even stricter set of restrictions.

In general, I would expect -O3 to spend enough cpu to discover which functions are pure, so it's unclear if xxhash will receive a measurable boost to performance with these function qualifiers,
that being said, I like them even if the only impact is to provide better code documentation.
Also, in the context of library linking, this is an information that the linker can't guess from just the function signature, so it could end up being actually useful on the user side.

@easyaspi314
Copy link
Contributor

In general, I would expect -O3 to spend enough cpu to discover which functions are pure, so it's unclear if xxhash will receive a measurable boost to performance with these function qualifiers.

It appears that with XXH_INLINE_ALL, Clang and GCC can't tell that XXH3 is pure without the annotation, but it can figure out XXH32 and XXH64.

@easyaspi314
Copy link
Contributor

easyaspi314 commented Oct 10, 2020

A few months ago, you mentioned how XXH3 is threadable. Obviously this would be an opt-in feature, as some programs like compilers (which I know at least Clang and GNU LD use XXH64) are designed to remain on a single thread to parallelize with make.

With some experimentation, it seems to be beneficial to spawn a second thread once you get to ~8-16MB.

On my phone (haven't tested on my PC yet because I have yet to master Windows threads), 6-8 MB seems to be the range where it is beneficial, with a max speed of 7.3 GB/s compared to 5.2 GB/s on one thread.

The implementation would be pretty simple; the most complicated thing here is dealing with the pthread struct and the accumulate loop which should probably be outlined to its own function to avoid copypasta.

I believe we can do a similar thing with CreateThread on Windows.

Note that this would technically conflict with the __attribute__((__pure__)) idea, although if we talk about the end effects, it will still be pure.

Code I was testing
#ifdef XXH_PTHREADS
#include <pthread.h>
/*
 * A data block for pthreads.
 * XXX: Some of these fields can probably be removed.
 */
typedef struct {
    xxh_u64 acc[XXH_ACC_NB];
    const xxh_u8 *input;
    const xxh_u8 *secret;
    size_t secretSize;
    size_t nbStripes;
    size_t block_len;
    size_t nb_blocks;
    XXH3_f_accumulate_512 f_acc512;
    XXH3_f_scrambleAcc f_scramble;
} XXH3_pthreadData_t;

/*
 * The length at which xxHash should spawn a pthread.
 * Spawning threads has significant overhead and is a waste
 * of time for short hashes. 
 *
 * The most optimal value varies significantly on the platform.
 */
#ifndef XXH_PTHREADS_THRESHOLD
#  define XXH_PTHREADS_THRESHOLD (8 * 1048576U) /* 8 MiB */
#endif

XXH_NO_INLINE
void* XXH3_accumulate_pthread(void* d)
{
    XXH3_pthreadData_t* data = (XXH3_pthreadData_t *)d;
    size_t n;
    /* TODO: DRY, this loop is copied multiple times. */
    for (n = 0; n < data->nb_blocks; n++) {
        XXH3_accumulate(data->acc, data->input + n*data->block_len, data->secret, data->nbStripes, data->f_acc512);
        (data->f_scramble)(data->acc, data->secret + data->secretSize - XXH_STRIPE_LEN);
    }
    pthread_exit((void*)data);
}
#endif

/* Note: move XXH_alignedMalloc/XXH_alignedFree above this or give protos */

XXH_FORCE_INLINE void
XXH3_hashLong_internal_loop(xxh_u64* XXH_RESTRICT acc,
                      const xxh_u8* XXH_RESTRICT input, size_t len,
                      const xxh_u8* XXH_RESTRICT secret, size_t secretSize,
                            XXH3_f_accumulate_512 f_acc512,
                            XXH3_f_scrambleAcc f_scramble)
{
    size_t const nbStripesPerBlock = (secretSize - XXH_STRIPE_LEN) / XXH_SECRET_CONSUME_RATE;
    size_t const block_len = XXH_STRIPE_LEN * nbStripesPerBlock;
    size_t const nb_blocks = (len - 1) / block_len;

    size_t n = 0;

    XXH_ASSERT(secretSize >= XXH3_SECRET_SIZE_MIN);
#ifdef XXH_PTHREADS
    /*
     * XXH3 operates in blocks which are added together.
     *
     * Normally, this is constantly added to the acc array on the fly, like so;
     *   acc = acc + sum[0->N] { accumulate(N) };
     *
     * Due to the properties of addition, we can actually calculate blocks in
     * parallel if we start with a second acc starting zeroed:
     *   acc = (acc + sum[0->N/2] { accumulate(N) })
     *       + (  0 + sum[N/2->N] { accumulate(N) })
     *
     * Spawning a single pthread to process half of the data is
     * beneficial after about 8 MiB (*very* platform dependent).
     * There is not much use in spawning any more threads; it already takes
     * hundreds of thousands of iterations for there to be a benefit,
     * and it would get very messy and add even more overhead.
     */
    if (len >= XXH_PTHREADS_THRESHOLD) {
        /*
         * Using malloc is faster for some reason. Likely aliasing rules.
         */
        XXH3_pthreadData_t* threadData = (XXH3_pthreadData_t*)XXH_alignedMalloc(sizeof(XXH3_pthreadData_t), 64);
        /*
         * If malloc succeeds, try to start a thread, otherwise fall back to
         * the single threaded loop after the #endif.
         */
        if (threadData != NULL) {
            pthread_t thread;
            int threadLaunched;
            void* status = (void *)threadData;

            /* Fill the struct and set it to process the second half. */
            memset(threadData->acc, 0, sizeof(threadData->acc));
            threadData->input = input + ((nb_blocks - (nb_blocks / 2)) * block_len);
            threadData->secret = secret;
            threadData->secretSize = secretSize;
            threadData->nbStripes = nbStripesPerBlock;
            threadData->nb_blocks = nb_blocks - (nb_blocks / 2);
            threadData->f_acc512 = f_acc512;
            threadData->f_scramble = f_scramble;

            /*
             * Launch the thread on the second half of the input.
             *
             * We don't care about whether it actually started until later.
             */
            threadLaunched = !pthread_create(&thread, NULL, &XXH3_accumulate_pthread, status);

            /* Process the first half on the main thread */
            for (; n < nb_blocks / 2; n++) {
                XXH3_accumulate(acc, input + n*block_len, secret, nbStripesPerBlock, f_acc512);
                f_scramble(acc, secret + secretSize - XXH_STRIPE_LEN);
            }

            /*
             * If we have launched our thread, finish it, merge its acc with
             " the main thread's acc and mark the section as completed.
             * If it failed, we process the second half after the #endif
             * on the main thread. 
             */
            if (threadLaunched && !pthread_join(thread, &status)) {
                size_t i;
                /* Merge the acc fragments */
                for (i = 0; i < XXH_ACC_NB; i++) {
                    /* associative property */
                    acc[i] += threadData->acc[i];
                }
                /* Mark that the thread successfully processed the second half */
                n += threadData->nb_blocks;
            }
            XXH_alignedFree(threadData);
        }
    }
#endif /* XXH_PTHREADS */
    for (; n < nb_blocks; n++) {
        XXH3_accumulate(acc, input + n*block_len, secret, nbStripesPerBlock, f_acc512);
        f_scramble(acc, secret + secretSize - XXH_STRIPE_LEN);
    }
    /* ... */
}

As I mention in the comment, I don't see any reason to spawn more than one helper thread, as we waste hundreds of thousands of possible accumulate loops by setting up each pthread, meaning 4 threads would likely require a ridiculous 64-128 MB and a much more complicated error handling routine.

@easyaspi314
Copy link
Contributor

easyaspi314 commented Oct 15, 2020

So I was wondering if we should start doing Doxygen? We don't necessarily have to set up a server for it.

Especially since xxhash.h is massive now, having a little Doxygen site might help and would probably be easier than writing xxh3_spec.md

It also gives us some opportunity to document the internals because we can group them.

Here are some examples:

XXH64 doxygen 2
xxh_force_memory_access Doxygen 4

Also, didn't we plan on switching XXH64_hash_t to unsigned long on 64-bit Linux?

@Cyan4973
Copy link
Owner Author

Cyan4973 commented Oct 15, 2020

I was wondering if we should start doing Doxygen?

Yes, that's a good idea.
Moving code comments to Doxygen parsing convention can be done progressively.

didn't we plan on switching XXH64_hash_t to unsigned long on 64-bit Linux?

I don't see a benefit in such a change

@easyaspi314
Copy link
Contributor

easyaspi314 commented Oct 15, 2020

I don't see a benefit in such a change

uint64_t is unsigned long on LP64, so it would be consistent.

@easyaspi314
Copy link
Contributor

easyaspi314 commented Oct 15, 2020

$ cat test.c
#include <xxhash.h>
#include <stdio.h>
#include <inttypes.h>

int main(void)
{
    printf("%#016" PRIx64 "\n", XXH64("hello", 5, 0));
    return 0;
}
$ gcc -std=gnu99 -O2 -Wall -c test.c -I xxHash
// Ok
$ g++ -std=gnu++11 -O2 -Wall -c -xc++ test.c -I xxHash
// Ok
$ gcc -std=gnu90 -O2 -Wall -c test.c -I xxHash -Wno-long-long
test.c: In function 'main':
test.c:7:12: warning: format '%lx' expects argument of type 'long unsigned int', but argument 2 has type 'XXH64_hash_t' {aka 'long long unsigned int'} [-Wformat=]
    7 |     printf("%#016" PRIx64 "\n", XXH64("Hello", 5, 0));
      |            ^~~~~~~              ~~~~~~~~~~~~~~~~~~~~
      |                                 |
      |                                 XXH64_hash_t {aka long long unsigned int}
In file included from test.c:3:
/usr/include/inttypes.h:127:34: note: format string is defined here
  127 | #define PRIx64   __PRI_64_prefix"x"  /* uint64_t */
$

In gnu90 mode, -Wformat fires off because on LP64, PRIx64 is "lx".

The reverse is true if you do "%llx", it will fire a warning on C++ and C99.

@easyaspi314
Copy link
Contributor

Some Doxygen documentation added in #462

@Cyan4973
Copy link
Owner Author

Cyan4973 commented Aug 9, 2021

Declaring relevant functions as pure and const would be a good follow up.

@easyaspi314
Copy link
Contributor

easyaspi314 commented Dec 14, 2021

  1. XXH_SIZEOPT config option
  • ==0: normal
  • ==1: Disables forceinline and manual unrolling
  • ==2: Reuse streaming API for single shot, other dirty size hacks?
  1. Potential speed boost: We can hash multiple acc blocks at a time until a scramble. So something like this:
accumulate_doubleAcc()
{
    xxh_u64 acc2[8] = {0};
    size_t n = 0;
    // alternative: duffs device but that might harm
    // interleaving
    if (nbStripes % 2 == 1) {
        accumulate_512(acc2);
        n++;
    }
    while (n < nbStripes) {
        accumulate_512(acc);
        n++;
        accumulate_512(acc2);
        n++;
    }
    for (n = 0; n < 8; n++) {
        acc[n] += acc2[n];
    }
}

I didn't see any benefit on NEON AArch64 no matter how many NEON lanes I chose, and ARMv7 and SSE2 don't have enough registers.

However, I think that AVX2 and AVX512 would likely benefit since their loops are tighter. I will benchmark on Ryzen when I get a chance.

Edit: Clang has no difference on Ryzen 5 3600 and GCC clearly gets confused.

@Cyan4973 Cyan4973 changed the title Suggestions list Suggestions list for future evolutions May 6, 2022
@Cyan4973
Copy link
Owner Author

Updated list of objectives for v0.8.2

@Cyan4973 Cyan4973 self-assigned this Jul 12, 2023
@Cyan4973
Copy link
Owner Author

I was considering (as a tentative objective) shipping DISPATCH=1 by default for the generation of xxhsum CLI, but there is still a significant amount of work to do to reach this stage safely, and I'm concerned it would delay v0.8.2 release by too long.
So, instead, I've pushed this objective to an hypothetical v0.8.3 future release.
(will be folded into v0.9.0 if need be).

@t-mat
Copy link
Contributor

t-mat commented Jul 12, 2023

As for XXH_OLD_NAMES, should we add deprecation message for it in v0.8.2?

@Cyan4973
Copy link
Owner Author

As for XXH_OLD_NAMES, should we add deprecation message for it in v0.8.2?

This seems like a good idea.
If I understand properly, XXH_OLD_NAMES is disabled by default, so the warning could be triggered just on detecting if it's set.

@t-mat
Copy link
Contributor

t-mat commented Jul 15, 2023

We can add "Milestone" via right side pane of issue view. And we also can assign/reuse it to other issues to indicate the milestone. It'll be useful at the future review of issues.
For example, we can assign v0.9.0 milestone to #483.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants