Specialize SIMD SHA-512 for half-length input, iterated hashing #5425

solardiz · 2024-01-14T02:20:05Z

This introduces specialized versions of SIMDSHA512body() and makes use of them in a few formats, including in the recently added Armory, where it provides a speedup of 4% on i7-4770K and some smaller speedups on some other machines I tested.

There are 3 specialized functions now, but I mostly care about the case of SSEi_HALF_IN|SSEi_OUTPUT_AS_INP_FMT (as that's what's used in some iterated formats now) and the original case from prior to my changes here (not to introduce performance regressions via increased code size). The third case of having SSEi_HALF_IN along with other or no flags is specialized merely to avoid code size increase in the original case. We could alternatively unsupport SSEi_HALF_IN except in SSEi_HALF_IN|SSEi_OUTPUT_AS_INP_FMT, but for now I opted to have it supported separately as well.

Further optimization potential is doing similar for SHA-256 (should be easy now) and maybe for others (it'd be different there), and introducing function specializations also for the previously existing flags (to reduce code size in the functions that many/most formats use).

solardiz · 2024-01-14T14:17:15Z

The SIMDSHA512body(): Build 3 specialized versions of the function commit here reduces the need to have manually optimized the STEP macro in the first commit here, if we also add a memset(&w[k][9], 0, 6 * sizeof(vtype));, as that lets the compiler figure out that certain values are always zero in certain builds of the function.

solardiz · 2024-01-15T22:29:02Z

SIMDSHA512body() and calls: Fix SIMD_PARA_SHA512 > 1 support

Testing SIMD_PARA_SHA512 = 2 after this commit, everything works on up to AVX2, including all of our GitHub Actions (https://github.com/solardiz/john/actions/runs/7534465934), however in my own testing on AVX-512 there are failures:

$ OMP_NUM_THREADS=1 ../run/john -test -form=office
Warning: OpenMP is disabled; a non-OpenMP build may be faster
NOTE: This is a debug build, speed will be lower than normal
Benchmarking: Office, 2007/2010/2013 [SHA1 512/512 AVX512BW 16x / SHA512 512/512 AVX512BW 8x2 AES]... FAILED (get_key(3))
$ OMP_NUM_THREADS=2 ../run/john -test -form=office
Will run 2 OpenMP threads
NOTE: This is a debug build, speed will be lower than normal
Benchmarking: Office, 2007/2010/2013 [SHA1 512/512 AVX512BW 16x / SHA512 512/512 AVX512BW 8x2 AES]... (2xOMP) FAILED (cmp_all(6))
$ OMP_NUM_THREADS=3 ../run/john -test -form=office
Will run 3 OpenMP threads
NOTE: This is a debug build, speed will be lower than normal
Benchmarking: Office, 2007/2010/2013 [SHA1 512/512 AVX512BW 16x / SHA512 512/512 AVX512BW 8x2 AES]... (3xOMP) FAILED (get_key(87))
$ OMP_NUM_THREADS=4 ../run/john -test -form=office
Will run 4 OpenMP threads
NOTE: This is a debug build, speed will be lower than normal
Benchmarking: Office, 2007/2010/2013 [SHA1 512/512 AVX512BW 16x / SHA512 512/512 AVX512BW 8x2 AES]... (4xOMP) FAILED (get_key(87))

The above is in a build with ASan. Exact same test errors without ASan. And exact same faulty indices remain across runs.

3 other formats fail tests as well, I haven't looked at them yet.

solardiz · 2024-01-15T22:33:52Z

3 other formats fail tests as well, I haven't looked at them yet.

They are:

Testing: Blackberry-ES10 (101x) [SHA-512 512/512 AVX512BW 8x2]... (4xOMP) FAILED (cmp_all(2))
Testing: eCryptfs (65536 iterations) [SHA512 512/512 AVX512BW 8x2]... (4xOMP) FAILED (cmp_all(2))
Testing: Drupal7, $S$ (x16385) [SHA512 512/512 AVX512BW 8x2]... (4xOMP) FAILED (cmp_all(2))

Out of these, Blackberry-ES10 and eCryptfs are affected by changes in this PR (and got good speedup when they work), but Drupal7 shouldn't be. I have no idea whether they worked with SIMD_PARA_SHA512 = 2 on AVX-512 before this PR; we don't normally set this on any arch currently.

solardiz · 2024-01-15T22:41:05Z

Testing SIMD_PARA_SHA512 = 2 [...] on AVX-512 there are failures:

Apparently, the AVX-512 specific portion of SIMDSHA512body(): Optimize SSEi_FLAT_OUT here is buggy for that case. Reverting this commit makes these 4 formats pass the same tests. I'll review and revise it.

And turn off the manual optimization since it started causing a major sha512crypt performance regression on AVX with RHEL6's old gcc after the iterated hashing commit. There appear to be no regressions from turning this off now, meaning that the compiler is hopefully able to figure the zeroes out now that they're written to w[] by this very function.

This avoids uninitialized warnings with RHEL6's old gcc, and is hopefully optimized out by more reasonable compilers

We still waste memory on the second halves. We could want to clean that up separately.

This avoids uninitialized warnings with RHEL6's old gcc, and is hopefully optimized out by more reasonable compilers. This new code could also be used without SSEi_LOOP, but trying to do so causes a major performance regression for e.g. sha512crypt with same RHEL6's old gcc, so we continue with memcpy() in the else path for now.

…SHA-256

This way, we start fetching all including the last hash sooner, and the fetches complete sooner.

solardiz mentioned this pull request Jan 14, 2024

Specialize SIMD SHA-256 for half-length input, iterated hashing #5426

Open

solardiz requested a review from magnumripper January 14, 2024 02:43

solardiz force-pushed the sha512opt branch 3 times, most recently from 4db6069 to 3e89333 Compare January 15, 2024 19:32

solardiz force-pushed the sha512opt branch 2 times, most recently from 0f49e5a to 148f28d Compare January 16, 2024 01:20

solardiz added 19 commits January 16, 2024 22:40

SIMDSHA512body(): Optimize for elements 9 to 14 being all zeroes

484f3eb

SIMDSHA512body(): Build 3 specialized versions of the function

5cad484

SIMDSHA512body() calls: Claim "elements 9 to 14 are 0" where we can

c8ba91e

SIMDSHA512body() and calls: Implement and use iterated hashing

5717f04

SIMDSHA512body(): Reset impossible flags to help the compiler optimize

a639a7d

SIMDSHA512body(): Add redundant initialization of w[] with SSEi_HALF_IN

2a856a3

This avoids uninitialized warnings with RHEL6's old gcc, and is hopefully optimized out by more reasonable compilers

SIMDSHA512body() calls: With SSEi_HALF_IN, do not initialize second half

c90b6fa

We still waste memory on the second halves. We could want to clean that up separately.

SIMDSHA512body() calls: With SSEi_HALF_IN, do not allocate second half

e38056d

SIMDSHA512body() and calls: Fix SIMD_PARA_SHA512 > 1 support

1f6500c

SIMD*body(): Optimize the case of SIMD_PARA_* == 1

493d3b3

SIMDSHA512body(): Optimize SSEi_FLAT_OUT

430b5b5

SIMDSHA512body() and Drupal7 format: Implement and use iterated hashing

c192421

SIMDSHA512body(): Exclude SSEi_LOOP from the generic SSEi_HALF_IN code

efe016b

SIMDSHA*body(): Exclude SSEi_FLAT_RELOAD_SWAPLAST support except for …

28e4a93

…SHA-256

SIMDSHA512body(): Specialized version for sha512crypt and some dynamics

402d249

simd-intrinsics.c: Copyright years update

a2f13d9

sha256crypt, sha512crypt CPU formats: Enable most ex-DEBUG test vectors

8681398

solardiz added 5 commits January 16, 2024 22:40

bench.c: Print multi-test/benchmark summary line even when aborted

0bcd2a1

bench.c: Print two fractional digits for numbers below 10

e7a35da

relbench: Update for current messages, handle more special cases

e34e3f5

Armory format: Fetch a part of each hash first, then the rest of each

885d141

This way, we start fetching all including the last hash sooner, and the fetches complete sooner.

Keplr format: Benchmark only Raw, print algorithm detail

5543601

solardiz force-pushed the sha512opt branch from 0a6f47a to 5543601 Compare January 16, 2024 22:04

solardiz changed the title ~~SIMDSHA512body(): Optimize for elements 9 to 14 being all zeroes~~ Specialize SIMD SHA-512 for half-length input, iterated hashing Jan 16, 2024

solardiz merged commit e0a147f into openwall:bleeding-jumbo Jan 17, 2024
31 of 32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specialize SIMD SHA-512 for half-length input, iterated hashing #5425

Specialize SIMD SHA-512 for half-length input, iterated hashing #5425

solardiz commented Jan 14, 2024

solardiz commented Jan 14, 2024

solardiz commented Jan 15, 2024

solardiz commented Jan 15, 2024

solardiz commented Jan 15, 2024 •

edited

Specialize SIMD SHA-512 for half-length input, iterated hashing #5425

Specialize SIMD SHA-512 for half-length input, iterated hashing #5425

Conversation

solardiz commented Jan 14, 2024

solardiz commented Jan 14, 2024

solardiz commented Jan 15, 2024

solardiz commented Jan 15, 2024

solardiz commented Jan 15, 2024 • edited

solardiz commented Jan 15, 2024 •

edited