Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specialize SIMD SHA-512 for half-length input, iterated hashing #5425

Merged
merged 24 commits into from Jan 17, 2024

Conversation

solardiz
Copy link
Member

This introduces specialized versions of SIMDSHA512body() and makes use of them in a few formats, including in the recently added Armory, where it provides a speedup of 4% on i7-4770K and some smaller speedups on some other machines I tested.

There are 3 specialized functions now, but I mostly care about the case of SSEi_HALF_IN|SSEi_OUTPUT_AS_INP_FMT (as that's what's used in some iterated formats now) and the original case from prior to my changes here (not to introduce performance regressions via increased code size). The third case of having SSEi_HALF_IN along with other or no flags is specialized merely to avoid code size increase in the original case. We could alternatively unsupport SSEi_HALF_IN except in SSEi_HALF_IN|SSEi_OUTPUT_AS_INP_FMT, but for now I opted to have it supported separately as well.

Further optimization potential is doing similar for SHA-256 (should be easy now) and maybe for others (it'd be different there), and introducing function specializations also for the previously existing flags (to reduce code size in the functions that many/most formats use).

@solardiz
Copy link
Member Author

The SIMDSHA512body(): Build 3 specialized versions of the function commit here reduces the need to have manually optimized the STEP macro in the first commit here, if we also add a memset(&w[k][9], 0, 6 * sizeof(vtype));, as that lets the compiler figure out that certain values are always zero in certain builds of the function.

@solardiz solardiz force-pushed the sha512opt branch 3 times, most recently from 4db6069 to 3e89333 Compare January 15, 2024 19:32
@solardiz
Copy link
Member Author

SIMDSHA512body() and calls: Fix SIMD_PARA_SHA512 > 1 support

Testing SIMD_PARA_SHA512 = 2 after this commit, everything works on up to AVX2, including all of our GitHub Actions (https://github.com/solardiz/john/actions/runs/7534465934), however in my own testing on AVX-512 there are failures:

$ OMP_NUM_THREADS=1 ../run/john -test -form=office
Warning: OpenMP is disabled; a non-OpenMP build may be faster
NOTE: This is a debug build, speed will be lower than normal
Benchmarking: Office, 2007/2010/2013 [SHA1 512/512 AVX512BW 16x / SHA512 512/512 AVX512BW 8x2 AES]... FAILED (get_key(3))
$ OMP_NUM_THREADS=2 ../run/john -test -form=office
Will run 2 OpenMP threads
NOTE: This is a debug build, speed will be lower than normal
Benchmarking: Office, 2007/2010/2013 [SHA1 512/512 AVX512BW 16x / SHA512 512/512 AVX512BW 8x2 AES]... (2xOMP) FAILED (cmp_all(6))
$ OMP_NUM_THREADS=3 ../run/john -test -form=office
Will run 3 OpenMP threads
NOTE: This is a debug build, speed will be lower than normal
Benchmarking: Office, 2007/2010/2013 [SHA1 512/512 AVX512BW 16x / SHA512 512/512 AVX512BW 8x2 AES]... (3xOMP) FAILED (get_key(87))
$ OMP_NUM_THREADS=4 ../run/john -test -form=office
Will run 4 OpenMP threads
NOTE: This is a debug build, speed will be lower than normal
Benchmarking: Office, 2007/2010/2013 [SHA1 512/512 AVX512BW 16x / SHA512 512/512 AVX512BW 8x2 AES]... (4xOMP) FAILED (get_key(87))

The above is in a build with ASan. Exact same test errors without ASan. And exact same faulty indices remain across runs.

3 other formats fail tests as well, I haven't looked at them yet.

@solardiz
Copy link
Member Author

3 other formats fail tests as well, I haven't looked at them yet.

They are:

Testing: Blackberry-ES10 (101x) [SHA-512 512/512 AVX512BW 8x2]... (4xOMP) FAILED (cmp_all(2))
Testing: eCryptfs (65536 iterations) [SHA512 512/512 AVX512BW 8x2]... (4xOMP) FAILED (cmp_all(2))
Testing: Drupal7, $S$ (x16385) [SHA512 512/512 AVX512BW 8x2]... (4xOMP) FAILED (cmp_all(2))

Out of these, Blackberry-ES10 and eCryptfs are affected by changes in this PR (and got good speedup when they work), but Drupal7 shouldn't be. I have no idea whether they worked with SIMD_PARA_SHA512 = 2 on AVX-512 before this PR; we don't normally set this on any arch currently.

@solardiz
Copy link
Member Author

solardiz commented Jan 15, 2024

Testing SIMD_PARA_SHA512 = 2 [...] on AVX-512 there are failures:

Apparently, the AVX-512 specific portion of SIMDSHA512body(): Optimize SSEi_FLAT_OUT here is buggy for that case. Reverting this commit makes these 4 formats pass the same tests. I'll review and revise it.

@solardiz solardiz force-pushed the sha512opt branch 2 times, most recently from 0f49e5a to 148f28d Compare January 16, 2024 01:20
And turn off the manual optimization since it started causing a major
sha512crypt performance regression on AVX with RHEL6's old gcc after the
iterated hashing commit.  There appear to be no regressions from turning
this off now, meaning that the compiler is hopefully able to figure the
zeroes out now that they're written to w[] by this very function.
This avoids uninitialized warnings with RHEL6's old gcc, and is
hopefully optimized out by more reasonable compilers
We still waste memory on the second halves.  We could want to clean that
up separately.
This avoids uninitialized warnings with RHEL6's old gcc, and is hopefully
optimized out by more reasonable compilers.  This new code could also be
used without SSEi_LOOP, but trying to do so causes a major performance
regression for e.g. sha512crypt with same RHEL6's old gcc, so we
continue with memcpy() in the else path for now.
@solardiz solardiz changed the title SIMDSHA512body(): Optimize for elements 9 to 14 being all zeroes Specialize SIMD SHA-512 for half-length input, iterated hashing Jan 16, 2024
@solardiz solardiz merged commit e0a147f into openwall:bleeding-jumbo Jan 17, 2024
31 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant