Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] cache secret array in assembly code for aarch64 SVE #693

Open
hzhuang1 opened this issue Mar 14, 2022 · 2 comments
Open

[RFC] cache secret array in assembly code for aarch64 SVE #693

hzhuang1 opened this issue Mar 14, 2022 · 2 comments

Comments

@hzhuang1
Copy link
Contributor

hzhuang1 commented Mar 14, 2022

@Cyan4973 @easyaspi314
I try to cache a whole secret array in assembly code. And the performance is better. The maximum performance data could read 18xxx. How do you think about this idea? (https://github.com/hzhuang1/xxHash/tree/dirty_v0.2.1)
Of course, current code is still ugly. Let's discuss the idea first.

=== benchmarking 4 hash functions ===
benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27)
xxh3 , 3516, 6130, 9465, 12771, 15576, 17283, 18393, 14563, 13763, 13596, 13689, 13672, 13775, 12043, 5684, 4912, 4834, 4923, 4874
XXH32 , 1338, 1436, 1486, 1521, 1534, 1542, 1546, 1536, 1505, 1507, 1507, 1507, 1508, 1458, 1264, 1206, 1213, 1212, 1208
XXH64 , 2509, 2799, 2973, 3066, 3122, 3102, 3154, 3136, 3057, 3063, 3066, 3066, 3064, 2899, 2197, 2005, 2009, 2007, 1993
XXH128 , 3392, 5864, 9421, 12495, 15354, 17187, 18306, 14565, 13733, 13822, 14105, 14172, 14208, 12229, 5779, 5016, 4898, 4898, 4851
Throughput small inputs of fixed size (from 2 to 2 bytes):
xxh3 , 72821732

@easyaspi314
Copy link
Contributor

easyaspi314 commented Mar 14, 2022

Ok I'm having a little trouble following the assembly code (I'll need some time to digest it 😅), but by caching do you mean something like a ring buffer?

void XXH3_accumulate(...)
{
    u64 secret_cache[8];
    for (int i = 0; i < 8; i++) {
        secret_cache[i] = XXH_read64(secret);
        secret += XXH_SECRET_CONSUME_RATE;
    }
    int cache_idx = 0;
    for (int i = 0; i < nbStripes; i++) {
        for (int j = 0; j < 8; j++) {
            XXH_accumulate_512_round(..., secret_cache[(cache_idx + j) % 8];
        }
        secret_cache[cache_idx] = XXH_read64(secret);
        secret += XXH_SECRET_CONSUME_RATE;
        cache_idx = (cache_idx + 1) % 8;
        ...
    }
}

Edit I see it now.

@hzhuang1
Copy link
Contributor Author

Ok I'm having a little trouble following the assembly code (I'll need some time to digest it sweat_smile), but by caching do you mean something like a ring buffer?

No, it's not a ring buffer. There're 31 Z registers in SVE. So use some to them to store the value of secret array.
The logic is that secret array is shared among accumulating and scrambling. SVE costs a lot on accessing I/O. If I could make accessing I/O only once, it could improve performance. And it's proved by the performance result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants