Skip to content
This repository has been archived by the owner on Dec 17, 2022. It is now read-only.

Vectorized, Accelerated Rescue Prime Hash Function Implementation, using OpenCL

License

Notifications You must be signed in to change notification settings

itzmeanjan/vectorized-rescue-prime

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Warning

I've stopped maintaining this project !

vectorized-rescue-prime

Vectorized, Accelerated Rescue Prime Hash Function Implementation, using OpenCL

Motivation

I've already implemented ZkSTARK-friendly Rescue Prime Hash function for 64-bit prime field 2^64 - 2^32 + 1, using SYCL/ DPC++, targeting accelerators, running in Data Parallel environments. But in that implementation Rescue Hash state is represented in terms of array of 12 field elements. So I can't make use of vector intrinsics for operating on state, while applying Rescue permutation rounds. Anyway Rescue Prime hash function itself is not parallel, so it's only useful if we've lots of same width input to operate on, we can offload computation onto accelerators. Which is why I decided to take advantage of available vector intrinsics for representing and operating on Rescue Hash state. But note, Rescue Hash function has a fixed state width of 12 and in OpenCL, I can't have a vector of length 12. So instead I'm using ulong16, which is a 16-element wide vector with each element being unsigned 64-bit integer, for representing hash state. This comes with an extra cost, as you've guessed. When I've M x N -many independent Rescue Prime Hashes to be computed, each of those M x N -many OpenCL work-items need to spend 4 * 64 bits = 256-bits for padding Rescue Hash state, so that I can use ulong16, but only first 12 elements are useful. I'm required to waste M x N x 32 -bytes of memory, when computing M x N -many independent hashes. But computation itself should be fast, because when using vector intrinsics for operating on Hash state, OpenCL device compiler should generate machine code which can make use of SIMD instructions very well.

As an application of Rescue Prime Hash function, I've implemented OpenCL accelerated Merkle Tree Construction function, where tree leaves are expected and all intermediate nodes are computed in parallel ( though somewhat involved due to complicated data dependency )

64-bit Prime field of interest.

You may want to read about Rescue Prime Hash function.

My implementation collects quite a lot of motivation from here

My other implementation of Rescue Prime Hash, in SYCL/ DPC++, non-vectorized.

Benchmark results of SYCL/ DPC++ Rescue Prime Hash function.

Prerequisite

  • You must have OpenCL development headers and ICD installed. Take a look here
  • You may also install clinfo, just to get an overview of your OpenCL installation.
sudo apt-get install clinfo
  • GCC is required for compiling host code.
$ gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
  • You'll also need to have make, clang-format.

Usage

  • Just run 👇 to compile host code; device code compilation is done at runtime.
make
  • Run executable using
./run
  • It should run some standard test cases against kernels and finally a benchmark suite is run on Rescue Prime hash_elements and merge kernel function, along with an application of rescue prime hash, where I compute intermediate nodes of merkle tree, given all leaves, using build_merkle_nodes function.

You probably want to take a look at vectorized hash_elements/ merge OpenCL kernels, if you want to use it in your project.

For accelerated Merkle Tree construction routine, check out this

You can find relevant examples here

Benchmark

For setting up benchmark in data parallel environment, I use one 2D computational grid of size M x N, and launch hash_elements, merge kernels with input of size M x N x 8. So each work-item will read 8 randomly generated 64-bit prime field elements and total of M x N-many independent hashes to be computed/ merging of hash digests to be performed. Output to be written into global memory with OpenCL vector store intrinsics. For writing output I provide with one buffer of size M x N x 4, so that each work-item can produce output of width 4-prime field elements i.e. 256-bits. In following benchmark results I only show time to compute hash/ merge by enabling profiling in command queue, host-device and device-host data transfer costs are not included, though they are performed just to ensure compiler doesn't end up optimizing too much so that benchmark suite doesn't run kernels as I expect it to be run.

I also benchmark routine where I compute all intermediate nodes of Merkle Tree, using Rescue Prime Hash function. I supply kernel with N-many random leaves of Merkle Tree, where N = power of 2. Due to somewhat complicated access patterns and data dependencies, Merkle Tree construction routine enqueues multiple kernel dispatches. In following benchmark results, only execution of all those kernel dispatches are computed ( i.e. summed together ), no device <-> host data transfer cost in considered, though performed ( explicitly ).

Intel CPU with OpenCL

make && ./run
running on Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz

kernel build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Options used by backend compiler: -cl-std=CL2.0 -w
Device build done
Kernel <test_apply_sbox> was not vectorized
Kernel <test_reduce_sum_vec2> was successfully vectorized (8)
Kernel <test_apply_mds> was not vectorized
Kernel <test_apply_inv_sbox> was not vectorized
Kernel <test_apply_rescue_permutation> was not vectorized
Kernel <hash_elements> was not vectorized
Kernel <merge> was not vectorized
Kernel <build_merkle_tree_tip_seq> was not vectorized
Done.

passed apply_sbox tests !
passed apply_inv_sbox tests !
passed apply_rescue_permutation tests !
passed apply_mds tests !
passed reduce_sum_vec2 tests !
passed merge tests !
passed build_merkle_nodes tests !

Rescue Prime Hash Benchmark

  hash_elements		  128 x   128		              345.25 ms		       47455.50 hashes/ sec
  hash_elements		  256 x   256		             1382.75 ms		       47395.51 hashes/ sec
  hash_elements		  512 x   512		             5473.14 ms		       47896.49 hashes/ sec
  hash_elements		 1024 x  1024		            22283.28 ms		       47056.63 hashes/ sec

Rescue Prime Merge Benchmark

          merge		  128 x   128		              342.02 ms		       47903.70 merges/ sec
          merge		  256 x   256		             1369.43 ms		       47856.36 merges/ sec
          merge		  512 x   512		             5509.15 ms		       47583.40 merges/ sec
          merge		 1024 x  1024		            21831.19 ms		       48031.09 merges/ sec

Rescue Prime Merkle Tree Benchmark

       merklize		   1048576 leaves		            21846.56 ms
       merklize		   2097152 leaves		            43669.84 ms
       merklize		   4194304 leaves		            87320.22 ms
       merklize		   8388608 leaves		            175640.44 ms

Nvidia Tesla V100 GPU with OpenCL

make && ./run
running on Tesla V100-SXM2-16GB

kernel build log:



passed apply_sbox tests !
passed apply_inv_sbox tests !
passed apply_rescue_permutation tests !
passed apply_mds tests !
passed reduce_sum_vec2 tests !
passed merge tests !
passed build_merkle_nodes tests !

Rescue Prime Hash Benchmark

  hash_elements		  128 x   128		                2.05 ms		     7992007.99 hashes/ sec
  hash_elements		  256 x   256		                7.62 ms		     8599838.75 hashes/ sec
  hash_elements		  512 x   512		               28.02 ms		     9357067.14 hashes/ sec
  hash_elements		 1024 x  1024		              102.22 ms		    10257848.66 hashes/ sec

Rescue Prime Merge Benchmark

          merge		  128 x   128		                1.93 ms		     8474576.27 merges/ sec
          merge		  256 x   256		                7.15 ms		     9169054.44 merges/ sec
          merge		  512 x   512		               26.40 ms		     9930948.87 merges/ sec
          merge		 1024 x  1024		              105.11 ms		     9975548.22 merges/ sec

Rescue Prime Merkle Tree Benchmark

       merklize		   1048576 leaves		              130.50 ms
       merklize		   2097152 leaves		              235.47 ms
       merklize		   4194304 leaves		              445.62 ms
       merklize		   8388608 leaves		              864.30 ms

About

Vectorized, Accelerated Rescue Prime Hash Function Implementation, using OpenCL

Topics

Resources

License

Stars

Watchers

Forks