Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

John on big/little architectures #5435

Open
ukasz opened this issue Feb 11, 2024 · 5 comments
Open

John on big/little architectures #5435

ukasz opened this issue Feb 11, 2024 · 5 comments

Comments

@ukasz
Copy link
Contributor

ukasz commented Feb 11, 2024

I ran some benchmarks on Intel 13900 and I have a feeling that the way john spreads work among threads is not optimal on machines with two types of cores. It seems like it is assuming that it will take similar amount of time for each set of hashes to be computed, which is not true if we have two way different types of cores. I suspect that P cores are done and they are waiting in synchronization point for E cores. The difference in performance is 2x, so they wait for half of the time.

I am not sure if this is john, or OpenMP's fault to be honest, without more digging, but I wanted to leave a note before I forget about that.

All cores:
OMP_NUM_THREADS=32 ./john -test -format=raw-sha256
Will run 32 OpenMP threads
Benchmarking: Raw-SHA256 [SHA256 256/256 AVX2 8x]... (32xOMP) DONE
Raw: 108770K c/s real, 3432K c/s virtual

P cores only - no HT (116K with HT)
OMP_PLACES='{0},{2},{4},{6},{8},{10},{12},{14}' OMP_NUM_THREADS=8 ./john -test -format=raw-sha256
Will run 8 OpenMP threads
Benchmarking: Raw-SHA256 [SHA256 256/256 AVX2 8x]... (8xOMP) DONE
Raw: 118947K c/s real, 14868K c/s virtual

E cores only:
OMP_PLACES='{16},{17},{18},{19},{20},{21},{22},{23},{24},{25},{26},{27},{28},{29},{30},{31}' OMP_NUM_THREADS=16 ./john -test -format=raw-sha256
Will run 16 OpenMP threads
Benchmarking: Raw-SHA256 [SHA256 256/256 AVX2 8x]... (16xOMP) DONE
Raw: 80871K c/s real, 5045K c/s virtual

@solardiz
Copy link
Member

Hi @ukasz. Great to see you playing with John again. Of course, we do have this issue. We let --fork processes run asynchronously, so that's a workaround, but that's the other extreme - those processes don't communicate at all (except for removing cracked hashes via "pot sync"), so they end up terminating at different times. If the OS scheduler shuffles processes between cores, I suppose those process termination times wouldn't be very different, so it could still be usable on CPUs like that. Also, for fast hashes like raw-sha256, our OpenMP scaling is quite poor even with cores of the same type, so --fork is recommended. We should probably flag more of these formats with FMT_OMP_BAD.

There's no easy fix given our current architecture. Ideally, we'd let threads or processes run asynchronously, yet re-allocate work between them once in a while or have them allocate candidate password batches from a shared pool dynamically. We need this not only for supporting bigLITTLE.

@solardiz
Copy link
Member

I am not sure if this is john, or OpenMP's fault to be honest

John's usage of OpenMP is straightforward, and this is how OpenMP behaves for parallel for loops by default. John could use OpenMP differently, not via parallel for loops, but then why use OpenMP at all (rather than perhaps directly use pthreads).

For parallel loops, OpenMP defaults to static scheduling, meaning each thread is allocated the same share of work, and then threads just wait (passively or actively, which you can control). It may also support dynamic scheduling, which is similar to what I wrote above in "let threads [...] run asynchronously, yet re-allocate work between them once in a while" (but at a lower level and higher re-allocation frequency and thus higher overhead than I'd prefer). You can try OMP_SCHEDULE=dynamic along with our --tune=auto (as I suspect this may need higher OMP_SCALE than we had tuned with static scheduling). I previously suggested this in https://www.openwall.com/lists/john-users/2019/06/14/6 but there were no follow-ups, so I don't know if it actually helped.

@ukasz
Copy link
Contributor Author

ukasz commented Feb 14, 2024

Hi @solardiz, I wanted to play with sha256 specifically, because I noticed some improvement
in the latency of sha-ni instructions on newer architectures (compared to the initial relase) on Agner's fog website.
I was wondering how that could stack against john's implementation.
Some initial testing showed that due to low register usage on sha256 we should be able to calculate two hashes at once using sha extensions and this results with 1.5x perf compared to calculating just one hash.
So in general the question was what is faster 1.5x with sha-ni or john's avx implementation, unfortunately I don't know yet.

Moving back to the scaling, do you know the reason behind poor scaling of those formats?
Unfortunatelly I did not observe any improvement by using different OpenMP scheduling, which sounded the most promising.
The best results for this particular format I got by reducing OMP_NUM_THREADS to 24 (120520K)

OpenMP still gives way simpler programming interface compared to pthreads, so with larger keys per crypt values and assumption that threads are bound to the CPUs, we could mesure performance of each thread between crypt_all calls (a moving average of the last few calls) to better split work manually.
In general case what you described is probably the best solution, because it does make any assumptions about scheduling and performance at all.

@solardiz
Copy link
Member

I took the SHA-NI topic to its own issue, #5437.

Moving back to the scaling, do you know the reason behind poor scaling of those formats?

Yes, it's primarily that our candidate password generation is single-threaded and scalar and executes only sequentially with the hashing. So e.g. for SHA-256 on AVX2 even with just 1 thread, we first sequentially do at least 8 times set_key (times a scaling/buffering factor), then one crypt_all. With more threads, it's correspondingly more sequential single-threaded calls to set_key per crypt_all.

One way to mitigate this without a complete redesign would be to enhance our formats interface to support two sets of keys/hashes and alternate between them, and have a candidate password generation / set_key thread run in parallel with the rest of the threads being in crypt_all for the previous set. This would also have way too limited scalability, but for small thread counts it can hide the problem fully (if the hash comparisons are also addressed, see below). It would also slightly speed up even the slower formats when using GPUs/FPGAs and there are few or no different salts.

We also have a similar bottleneck with hash comparisons. Technically, we can have a parallel for loop in cmp_all too, but this isn't obviously beneficial since the individual comparisons are so fast. What works better is doing the comparisons inside the main parallel for loop in crypt_all when the number of loaded hashes is small (or else we need to duplicate/invoke the bitmap code also in/from there), like I did in DES_bs_b.c (where it was especially reasonable to do due to bitslicing).

A solution is to implement built-in mask mode support in CPU formats like this. We currently have on-device mask mode support in some GPU and FPGA formats. This moves most candidate password generation and hash comparisons within the format's parallel processing. This is efficient, but it only works for mask mode (including when used on top of another mode) and it requires per-format code (or formats invoking shared code).

Unfortunatelly I did not observe any improvement by using different OpenMP scheduling, which sounded the most promising.
The best results for this particular format I got by reducing OMP_NUM_THREADS to 24 (120520K)

There's really no point OpenMP'ing this format as it currently is, on current CPUs. Ditto for other unsalted fast hash formats. For your experiments, I recommended that you use an iterated format. If you want to play specifically with SHA-256, then use e.g., PBKDF2-HMAC-SHA256 and multiply the speed by its iteration count and by 2 (for HMAC) to obtain SHA-256 speed.

OpenMP still gives way simpler programming interface compared to pthreads, so with larger keys per crypt values and assumption that threads are bound to the CPUs, we could mesure performance of each thread between crypt_all calls (a moving average of the last few calls) to better split work manually.

With OpenMP, both measuring performance of each thread and splitting the work manually is tricky. While I can think of hacks to do that, OpenMP's simplicity is lost if we use those.

@solardiz
Copy link
Member

We should probably flag more of these formats with FMT_OMP_BAD.

Actually, rawSHA256_fmt_plug.c already has that flag. However, we only print the warning and suggestion to use --fork when starting a cracking session, not on --test. This makes sense, because we don't support --fork with --test.

We could want to add the #if !FAST_FORMATS_OMP logic to this and the raw SHA-512 format, and maybe others. We had initially spared them, but with growing typical SIMD vector width and core counts, this makes less sense now. We could also consider SIMD in the compile-time logic (leaving OpenMP enabled by default in scalar builds), but this can be confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants