-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XXH3 performance issue on ARM & WASM with >128byte inputs #211
Comments
Thanks @aras-p , that's very instructive. 128-bytes is the limit between the "small input" and "large input" modes. However, it's difficult to tell if the Web Assembly is unlikely to be able to generate vectorial instructions, at least not yet. The rewritten "scalar" mode is expected to perform as well or better than |
It is triggered properly, so 🤷♀ I'll test the dev branch soon. |
Thanks for the macro suggestion @aras-p . It will be very useful for Visual. Indeed, I will look again at the ARM performance, it could be disappointing at this stage. One issue with ARM performance is that it's so different from one chip to another, and from one compiler to another. It's much more difficult to draw conclusions from a single test unfortunately, which complicates development. Regarding Wasm : thanks for this very interesting test! We know that the scalar path is slower than Regarding the performance gap : I'll look at this possibility, see if it can be integrated smoothly in the current design. |
My understanding is that Wasm has native 64-bit types (see https://github.com/WebAssembly/design/blob/master/Semantics.md#64-bit-integer-operators) that probably map almost 1:1 to hardware instructions, but "pointer sizes" are 32-bit (see https://github.com/WebAssembly/design/blob/master/Semantics.md#addressing), so by that aspect it's a 32-bit platform. But I think the 64-bit operations should be "fast" when running on an actual 64-bit CPU. |
So it would be somewhat equivalent to the |
By the way, it seems that even
|
Oooh, turns out Xcode by default does Sorry for the confusion! The MSVC and WebAssembly points still stand, since they are compiled with |
Does that sound reasonable to attempt enabling SIMD flag for WebAssembly ? |
Maybe, but without some additional work that won't do magic. Since it requires code to not use SSE2 or NEON et al intrinsics, but rather use the Clang's special "vector types" extension (that very much looks like shader languages -- e.g. I'm not sure what is the browser support for that SIMD thing either, since today it still looks to be "experimental". Maybe I'll play around with it someday soon, stay tuned! |
Wow, good point: There is an interesting word about auto-vectorization :
I mention it because I remember being impressed by LLVM auto-vectorization with recent versions of XXH3 (
Notice the huge speed difference for the scalar path, which I interpret as "llvm was able to auto-vectorize the scalar path". |
as suggested by @aras-p : #211 (comment) `__SSE2__` is not defined with Visual, which therefore misses `SSE2` code path. Extended the detection macro to also include x64 detection (_M_AMD64, _M_AMD64) since any x64 cpu necessarily supports SSE2.
Updated |
As suggested by your graph, it might be preferable to extend the "short mode" to longer distances. It seems there won't be a "good size for everybody". The "ideal" threshold vary a lot depending on target and compiler version. For the time being, this feature branch settles on 384 bytes. It might be a bit high, since I can measure a loss of performance in the 250-384 range on my macOS laptop when using default compiler. The situation though is different when using Anyway, I would appreciate any feedback on this topic, should you feel that the threshold would better be positioned at some different value. |
It really does matter for the compiler. Clang really excels at this code, but I found both MSVC and GCC struggling to generate decent code. IDK why, it is pretty damn obvious what code we want... 😕 I have considered doing a manual partial unroll like we do for XXH32/64. Two iterations per branch seems to be beneficial on most platforms, and any larger seems to be inconsistent. x86, however, seems to prefer the entire thing to be unrolled. 🤔 I'll try to get back to work on this next week or so. I still have a few classes to finish up. As for the threshold, I think there is nothing wrong with bumping it up a bit. |
OK, so I've been running a long serie of performance tests during the week-end, to target a more precise "threshold" between the short and long mode. With For For So, based on these figures, I'm currently considering to update the threshold to something like 288 bytes, leaving only a reduced section where performance is detrimental for both For Visual, compared to current threshold (128 bytes), this will only help a little, by delaying the threshold at which the performance cliff happens, and making it smaller (<2x). But it will still be there. I suspect that Visual requires a dedicated investigation, in order to tame, or solve, the performance cliff observed on reaching the To give an idea of the performance gap between compilers, at 300 bytes input, presuming long mode (as now) using note : in contrast, performance differences for the "short" mode are much lighter : still presuming 300 bytes input, throughput tests put |
Great investigation! BTW which MSVC version you used for tests? My tests were on VS2017 15.9 version; looking at release notes they supposedly have improved SIMD related optimizations in VS2019 quite a bit (but I haven't checked with that myself). And yeah 384 bytes cutoff for long mode sounds like a good balance. |
I was using Visual Studio 2017, though I did not checked the exact version. I think I have another VM with Visual Studio 2019, so I could check out. edit : from what I can tell, Visual 2019 doesn't help much for the performance impact of switching to I've found the MS blog where they announce vector code optimizations for VS2019 : I couldn't compare VS2017 vs VS2019 directly. It might be that 2019 is indeed better. It's just still pretty bad in term of startup time. The "steady" performance is decent through (22 GB/s, vs 26 GB/s for |
Hi @aras-p , what would you recommend to test Wasm performance ? |
For information, A side effect of this change is that the speed on Visual Studio 2019 has been greatly increased when enabling I still have to integrate the "midsize" patch. I must first ensure this extension doesn't introduce bad quality hashes on the considered range. |
My little testbed is here, https://github.com/aras-p/HashFunctionsTest -- I can try the latest |
The scalar code has not changed much since last test, so results are expected to be similar. One objective would be to introduce the |
I tried using |
The drop at 1 KB is really strange, and really strong. The one thing that comes to mind is that it may correspond to the "scrambler" stage, which happens every KB when using all default parameters. It's unclear why it nonetheless does. It must have happened in a recent modification, since it wasn't the case in previous measurements. I'll investigate. |
So, for information, It's one situation where it's difficult to have it all.
On another front, Visual Studio clearly has a preference for I haven't been able to test Wasm performance (both systems where I installed Life would have been too simple if there was one setting good for everyone... |
For Wasm, inlining vs static vs non-inlining there does not seem to affect things much. What does seem to help, as in no more performance dip for >1kb data sizes, is removing this from
|
That makes sense, less code to translate |
OK, I guess we could special case Wasm, and remove the unrolling statement in this case. It's unexpected though that this is the reason, since this unrolling statement has been there for a while now, was already present in Why would it now be a problem is mysterious. Maybe some combination of |
Strangely enough, for I will nonetheless proceed with extending the "short mode" to longer lengths, since the performance drop at 129 bytes is clearly visible on native code. |
With latest updates of 128-bit variant merged, It integrates the mid-size strategy discussed here, to reduce the performance gap after 128 bytes. If you wish @aras-p , you may be interested in benchmarking this latest version (in |
Ok so here's a comparison on various platforms/compilers; I've only tested the 128-bit variant since that's what I'm interested in :) In all the graphs I've included the previous xxHash dev branch variant from June 6 ( MacBookPro 2018 (Core i9 2.9GHz), Xcode 10.2.1, x64. The perf dip moved to larger input sizes, and at larger sizes the new XXH3 seems to be a bit slower than before. PC (AMD TR 1950X 3.4GHz), Visual Studio 2017 (15.9), x64. Similar to Mac, the perf dip moved to larger sizes, new XXH3 seems to be a bit slower than before, up untip even larger sizes where it becomes faster. Same as above, just with Visual Studio 2019 (16.2). XXH3 results are improved a bit across whole range of input sizes, I suspect due to SIMD optimizer codegen improvements in 2019. General performance profile similar. iPhone/ARM (Apple A9), ARM64. The perf dip at midsizes is gone, but at larger input sizes hashing is slower than before. Now for 32-bit-ish platforms. WebAssembly, on 2018 MacBookPro, Chrome 75. General perf profile similar, hashing slightly slower at larger input sizes. PC (AMD TR 1950X 3.4GHz), Visual Studio 2017 (15.9), x86 i.e. 32-bit. Something in XXH3 has changed that makes it much faster now at larger sizes, nice! Same as above, just with Visual Studio 2019 (16.2). Very similar results. |
Thanks for this very thorough and insightful analysis @aras-p . Some thoughts : Speed of new version : I expect the new XXH128 to be ~10% slower than previous one. This is the consequence of altering the design to answer a grievance on previous version, namely, that a combination of 64-bit mixers doesn't make a 128-bit hash. While I would agree if the goal was a cryptographic-level hash, I'm not convinced it's as important for a fast hash merely offering good dispersion qualities, because it takes a very coordinated scenario to degrade the 128-bit hash into a 64-bit one. Yet, there is some polarization on the topic, and I don't want to waste energy on it. Therefore, now, every input impacts 128-bit of accumulator immediately. This translates into some additional runtime cost, which feels reasonable, but is still a cost. I'm also surprised that it can sometimes be faster than previous version, but maybe something is better "streamlined" in the new version. Performance dip : The 128-bit final mixer has a larger fixed cost than the 64-bit one, expectably. Therefore, it takes more time to dilute that cost, meaning the mid-size performance dip is larger. But the new mid-size mode, introduced for Comparison with AES-based algorithm : Wasm performance : yep, we still have this issue that, at its core, XXH3 is a vectorizable algorithm using 32-bit operations, and it's comparatively less efficient when facing a 64-bit instruction set without vectorization capability. This generally does not happen ( Visual Studio x86 performance : I presume your new detection macro makes it possible for Visual to pick-up the SSE2 mode, it's likely the reason for this great speed up :) Thanks ! |
Yup, it all makes sense, thanks for detailed explanation! And indeed comparison with Meow hash is not fair, especially since I haven't compiled XXH3 with AVX2 either. |
Hmm For starters, let's force the manual 128-bit multiply on wasm. Clang defines This not only has the overhead of a function call, but it also results in two redundant 64->64 multiplies by zero. While this only wastes a few cycles on 64-bit targets, you are probably looking at 25+ cycles each multiply on a 32-bit browser. Obviously, benchmarks will tell the truth. @aras-p, can you try adding Also, try disabling the unroll pragma. |
I've tested this suggestion, and it seems to work well, at least on my laptop. update : done |
It's supposed to be already done in this version : though the check uses the |
I guess this topic can be closed now, since the introduction of the "middle-range" algorithm for len <= 240 has improved situation. Of course, it can be re-opened of need be. |
If they haven't changed, then there's no need to copy them to the shared array buffer, or to render them into the <canvas>. Takes idle tab CPU usage to ~15%. Hashing takes ~0.25ms or less, thus is not an appreciable overhead when the contents do change. Hashing is done via SpookyHash V2 taken from https://burtleburtle.net/bob/hash/spooky.html. There's some benchmarking of hash functions with Emscripten/WebAssembly done by @aras-p: https://aras-p.info/blog/2016/08/02/Hash-Functions-all-the-way-down/ https://aras-p.info/blog/2016/08/09/More-Hash-Function-Tests/ Cyan4973/xxHash#211 (comment) SpookyV2 had fast speed, and was easier to include than t1ha
as suggested by @aras-p : Cyan4973/xxHash#211 (comment) `__SSE2__` is not defined with Visual, which therefore misses `SSE2` code path. Extended the detection macro to also include x64 detection (_M_AMD64, _M_AMD64) since any x64 cpu necessarily supports SSE2.
- Don't use computed gotos on all wasm platforms - Use getentropy to initialize JS random state, which in turn gets entropy from near platform's random_seed (included in near's wasi-sdk fork) - Skip atomics test cases in test262 since atomics is unsupported anyway and often makes the runner go stuck The "use 64-bit limbs in libbf" suggestion is formally dropped since the extra overhead is too high, see Cyan4973/xxHash#211 (comment) for details.
I was playing around with XXH3 from 0.7.0 release, and there's a curious performance drop with data sizes >128 bytes on Arm64 and on WebAssembly, but not on x64.
Here's x64 (Xcode 10.1 Clang, on a MacBookPro), things look great:
However on Arm64 (iPhone SE, compiled with Xcode 10.1 Clang) -- notice the big drop after 128 bytes input size:
On WebAssembly (Firefox 67, running in the same MacBookPro), there's a drop after 128 bytes too, where XXH3 becomes much slower than previous xxHash:
The text was updated successfully, but these errors were encountered: