-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
__uint128_t is not portable to 32-bit #170
Comments
Yes, you are right @easyaspi314 , and this is a known issue. |
Actually, the scalar one doesn't look too bad. GCC still hates the code and insists on using the stack (or an ugly partial vectorization for the first one), but Clang emits clean code for everyone. It emits the best code for ARMv7 though, but that is mainly because of the accumulate instructions.
void umaal(uint32_t *RdLo, uint32_t *RdHi, uint32_t Rn, uint32_t Rm)
{
uint64_t prod = (uint64_t) Rn * (uint64_t) Rm;
prod += (uint64_t) *RdLo;
prod += (uint64_t) *RdHi;
*RdLo = prod & 0xFFFFFFFF;
*RdHi = prod >> 32;
} Here's a comparison of a few implementations I found online. The second one might be better because GCC doesn't vectorize it and it isn't terrible on ARM (it only adds an extra instruction). On MSVC, we want to do that intrinsic instead of cast because MSVC is stupid and will try to do a full 64-bit multiply. |
I've uploaded It also contains a dedicated path for ARM |
https://gcc.godbolt.org/z/PRtMJy Boom. Money. 37-48 instructions on x86, 8-9 instructions on ARMv6+, and thanks to I had to use inline assembly for ARM because I couldn't get Clang or GCC to figure out my request for umull r12, lr, r0, r2 @ {r12, lr} = (U64)r0 * (U64)r2
mov r5, #0 @ r5 = 0
mov r4, #0 @ r4 = 0
umaal r5, lr, r1, r2 @ {r5, lr} = ((U64)r1 * (U64)r2) + r5 + lr
umaal r5, r4, r0, r3 @ {r5, r4} = ((U64)r0 * (U64)r3) + r5 + r4
umaal lr, r4, r1, r3 @ {lr, r4} = ((U64)r1 * (U64)r3) + lr + r4
adds r0, lr, r12 @ <-.
@ {r0, r1} = (U64){lr, r4} + (U64){r12, r5}
adc r1, r4, r5 @ <-' I don't think I can optimize it more than that. That's only 8 instructions. I added I still think the x86 can be optimized more, although it might just be the fixed output registers that causes all that shuffling. It actually runs quite fast: uint32_t val32[4];
uint64_t *val = val32;
uint64_t sum = 0;
srand(0);
double start = (double)clock();
for (int i = 0; i < 10000000; i++) {
val32[0] = rand();
val32[1] = rand();
val32[2] = rand();
val32[3] = rand();
sum += XXH_mul128AndFold_32(val[0], val[1]);
}
double end = (double)clock();
printf("%lld %lfs\n", sum, (end - start) / CLOCKS_PER_SEC);
For comparison, this is in 64-bit mode with
Only 35% slower considering how much more work it does. Side note: Clang optimizes the umulh x8, x1, x0 // x8 = ((__uint128_t)x1 * x0) >> 64
madd x0, x1, x0, x8 // x0 = (x1 * x0) + x8; |
Looks great ! |
It's definitely slower on ARM despite the significantly fewer instructions, but that is mainly due to the rand() call being pretty sophisticated on Bionic.
Replace it with uninitialised data, force it in a register and use
(without the |
I noticed that you added
__uint128_t
to the xxh3. That is not portable to 32-bit; the type is only defined on 64-bit, and as you see below, for a pretty decent reason.I suggest this:
However, I do want to mention that GCC doesn't really like this code, and it half vectorizes it.
I'm trying out a potential SSE2/NEON32 version, though, which has a chance of being faster or slower. The first part I got from MSDN, but the second part confuses me.
If I do it with vector extensions, Clang gives me this:
However, in the middle it switches to scalar. I kinda wish I had NEON's half registers…
The text was updated successfully, but these errors were encountered: