Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BLAKE3 hashing algorithm (single-threaded C-based implementation) #10600

Closed
wants to merge 1 commit into from

Conversation

silvanshade
Copy link

@silvanshade silvanshade commented Apr 23, 2024

Motivation

This PR adds BLAKE3 to the available hashing algorithms.

NOTE: Although this PR is complete (and I would appreciate any feedback) I'm adding it as a draft for now because I would also like to try working on alternative PR based on the multi-threaded Rust implementation (more below).

Context

The change is relatively small and non-invasive.

I added the BLAKE3 source tarball as a derivation and used this to define a NIX_BLAKE3_SRC environment variable which is then referenced in the makefiles.

The recommended way to build the BLAKE3 C implementation is just to add the files directly to the build system rather than trying to compile separately as a library, so that's the reasoning for fetching the tarball.

In order to handle building source files from NIX_BLAKE3_SRC (which is read-only) I needed to add some custom build rules specifically for those files.

I also added platform detection for ARM and x86_64 (along with detection for Darwin, Linux, and Windows) and use this to conditionally compile the appropriate SIMD implementations for the given platform.

I use the assembly files directly rather than the C-based intrinsics versions since that is also the recommended approach:

For each of the x86 SIMD instruction sets, four versions are available: three flavors of assembly (Unix, Windows MSVC, and Windows GNU) and one version using C intrinsics. The assembly versions are generally preferred. They perform better, they perform more consistently across different compilers, and they build more quickly. On the other hand, the assembly versions are x86_64-only, and you need to select the right flavor for your target platform.

The BLAKE3 dispatcher will automatically fall back to the portable implementation if a hardware accelerated implementation is unavailable.

Performance

I have run benchmarks of the implementation which I detail below.

First, though, some important things to note:

Apple M3 Max

  • compiled with: CFLAGS="-O3 -mcpu=apple-m2" configurePhase (and OPTIMIZE=1)

BLAKE3

hyperfine --warmup 3 './outputs/out/bin/nix --extra-experimental-features nix-command hash file --type blake3 ~/Downloads/largefile.bin'
Benchmark 1: ./outputs/out/bin/nix --extra-experimental-features nix-command hash file --type blake3 ~/Downloads/largefile.bin
  Time (mean ± σ):      2.701 s ±  0.006 s    [User: 2.345 s, System: 0.350 s]
  Range (min … max):    2.692 s …  2.714 s    10 runs
hyperfine --warmup 3 './outputs/out/bin/nix --extra-experimental-features nix-command hash file --type sha256 ~/Downloads/largefile.bin'
Benchmark 1: ./outputs/out/bin/nix --extra-experimental-features nix-command hash file --type sha256 ~/Downloads/largefile.bin
  Time (mean ± σ):      2.105 s ±  0.005 s    [User: 1.748 s, System: 0.352 s]
  Range (min … max):    2.097 s …  2.110 s    10 runs
hyperfine --warmup 3 './outputs/out/bin/nix --extra-experimental-features nix-command hash file --type sha512 ~/Downloads/largefile.bin'
Benchmark 1: ./outputs/out/bin/nix --extra-experimental-features nix-command hash file --type sha512 ~/Downloads/largefile.bin
  Time (mean ± σ):      3.327 s ±  0.006 s    [User: 2.964 s, System: 0.356 s]
  Range (min … max):    3.321 s …  3.338 s    10 runs

AMD Zen 4 Ryzen 9 7950x

  • compiled with: CFLAGS="-O3 -march=znver4" configurePhase (and OPTIMIZE=1)

BLAKE3

hyperfine --warmup 3 './outputs/out/bin/nix --extra-experimental-features nix-command hash file --type blake3 ~/Downloads/largefile.bin'
Benchmark 1: ./outputs/out/bin/nix --extra-experimental-features nix-command hash file --type blake3 ~/Downloads/largefile.bin
  Time (mean ± σ):     915.0 ms ±  11.2 ms    [User: 573.2 ms, System: 339.8 ms]
  Range (min … max):   898.6 ms … 932.4 ms    10 runs

SHA256

hyperfine --warmup 3 './outputs/out/bin/nix --extra-experimental-features nix-command hash file --type sha256 ~/Downloads/largefile.bin'
Benchmark 1: ./outputs/out/bin/nix --extra-experimental-features nix-command hash file --type sha256 ~/Downloads/largefile.bin
  Time (mean ± σ):      2.344 s ±  0.016 s    [User: 1.980 s, System: 0.359 s]
  Range (min … max):    2.320 s …  2.371 s    10 runs

SHA512

hyperfine --warmup 3 './outputs/out/bin/nix --extra-experimental-features nix-command hash file --type sha512 ~/Downloads/largefile.bin'
Benchmark 1: ./outputs/out/bin/nix --extra-experimental-features nix-command hash file --type sha512 ~/Downloads/largefile.bin
  Time (mean ± σ):      4.529 s ±  0.039 s    [User: 4.156 s, System: 0.363 s]
  Range (min … max):    4.490 s …  4.593 s    10 runs

@silvanshade
Copy link
Author

I closed this because I now have the Rust rayon-based implementation working locally which is several times faster. I'll open a new PR soon.

@theoparis
Copy link

theoparis commented May 10, 2024

I don't want to have to rely on rust to build nix 😕 A better solution would probably be implementing multithreading in the blake3 C implementation...

@silvanshade
Copy link
Author

I don't want to have to rely on rust to build nix 😕

Do you have a technical reason why?

Rust is available for use in the Linux kernel, Windows kernel and APIs, Firefox, Chrome, is being integrated into GCC, and is used in many other mission critical and established software. And then there's nickel. I don't think reliability is an issue at this point.

A better solution would probably be implementing multithreading in the blake3 C implementation...

I disagree. There are good reasons why this hasn't happened.

In order for a multithreaded implementation in C++ to match the performance of the Rust Rayon-based implementation, one would need to use something like Kokkos, OpenMP, SYCL, or some similar framework.

Those frameworks often have non-trivial external dependencies and set up and are generally less portable than Rust.

Then, aside from portability, there's the overall complexity of using those frameworks, and the additional maintenance burden of a completely separate multithreaded implementation.

Some of these issues have been addressed further in the following posts from the BLAKE3 maintainer:

Also a lot of additional context in these threads:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

2 participants