Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge celo/zexe w/ arkworks #127

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jon-chuang
Copy link
Contributor

@jon-chuang jon-chuang commented Dec 11, 2020

Features:

  • Batch Affine ops
  • Batch and ordinary w-NAF
  • GLV framework. GLV precomputation script.
  • GLV impl for both batch affine and projective mul
  • Batch bucket addition tree
  • Batched MSM (1.4-1.7x bump)
  • Batch subgroup verification based on addition to random buckets
  • 12-limb .s asm impl for montgomery mul and constant-time add/sub
    2D test matrix feature-gating with axes (curve, test type)
    Refactor tests to use common template
    Code timing instrumentation with fine grained control of functions and callers

Combined speedup (with w-NAF) for:

  • Batched mul (with GLV): 2.3-2.7x.
  • Non-batch projective mul (with GLV): 2.1x.
  • Without GLV: 1.65-1.8x and 1.4x resp.

Speedup for subgroup verification:

  • Over old impl 25-80x.
  • Over new batch mul: 10-30x.

Spec sheet (also included in code comments): https://hackmd.io/@celo/S10UeWUBD

I am not willing to break this PR into smaller PRs as that was already done in Zexe, and now I have had to repeat work to merge again to arkworks, so it's too much work, no thank you.

TODOs:

  • Add glv generation script.
  • Sister PR for curves which implement glv, cuda, test feature gating, tests for new features etc.
  • Feature gate for BigInt768 asm (.s rather than ff-asm).

Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.

  • Targeted PR against correct branch (main)
  • Linked to Github issue with discussion and accepted design OR have an explanation in the PR that describes this work.
  • Wrote unit tests
  • Updated relevant documentation in the code
  • Added a relevant changelog entry to the Pending section in CHANGELOG.md
  • Re-reviewed Files changed in the Github PR explorer

// Checking the modulus at compile time
let first_bit_set = P::MODULUS.0[$limbs - 1] >> 63 != 0;
let mut all_bits_set = P::MODULUS.0[$limbs - 1] == !0 - (1 << 63);
for i in 1..$limbs {
all_bits_set &= P::MODULUS.0[$limbs - i - 1] == core::u64::MAX;
all_bits_set &= P::MODULUS.0[$limbs - i - 1] == !0u64;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
all_bits_set &= P::MODULUS.0[$limbs - i - 1] == !0u64;
all_bits_set &= P::MODULUS.0[$limbs - i - 1] == core::u64::MAX;

@Pratyush
Copy link
Member

Thanks for the PR and adapting it to arkworks!

Quick question: is the GPU scalar multiplication used to implement any operations (fixed/variable base MSMs/batch subgroup checks/batch muls)? I couldn't figure it out at a quick glance.

Also, I'd really appreciate it if you could you could point out the parts that need the closest review. (Since this is a large PR, if you could suggest a "review plan" so that I can figure out how to navigate the PR, that would also be great.)

@ValarDragon
Copy link
Member

ValarDragon commented Dec 11, 2020

Thanks for working on adapting these great changes to arkworks and upstreaming them!

I'd like to ideally split this PR into many smaller PR's for easier review, would you like me to take a stab at doing this, or would you like to?

To start with, I'd like to move the ark-serialize changes into its own PR, since that should be modular
Similary, a separate PR for the bw6 ASM

@ValarDragon
Copy link
Member

Ah sorry, didn't see the comment regarding not being willing to split up this PR. Is it fine if Pratyush and I go through and split up this into smaller PRs, and review/merge the smaller ones accordingly?

@@ -0,0 +1,181 @@
// void modadd768(const uint64_t x[12], const uint64_t y[12], const uint64_t m[13], uint64_t z[12])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there any automated process for how you got this assembly, or was it hand-written?

If there was an automated process, which was then edited for optimizations, could you share it please?

If it was hand-written, would it be possible to add some comments for how you derived whats going on?
(And also what features its doing, that leads it to be faster than native in expectation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that there was automatic process, and we eventually want to convert all the asm code to using .S, I think, as personally I really dislike inline ASM and also its the wrong usecase, since inline asm is meant for very short segments of code that one does indeed intend to be "inlined". Further, this would, at least notionally, remove the use of unsafe for asm, although, extern "C" technically is also unsafe, is it not?

@kobigurk should know more details on the automated process done by dag arne.

@ValarDragon ValarDragon mentioned this pull request Dec 13, 2020
6 tasks
@ValarDragon ValarDragon marked this pull request as draft December 16, 2020 16:19
@jon-chuang
Copy link
Contributor Author

Hi thanks all for the comments. I apologise that I have been feeling rather fatigued wrt to the PR due to the tediousness and also its size.

With regards to areas about which I have some doubts,

  • the CUDA code could be improved, maybe some thinking about how we would like to organise the code/expose interfaces/conditionally compile CUDA support. Currently, there is a lot of intermixing of locations and dependencies, and no clean separation of duties. One thing I think could make the difference is to have a CUDAKernel trait that is only a bound for relevant parameters etc when conditionally compiled with feature = "cuda". This is very satisfactory to me

To be continued.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Jan 21, 2021

Thanks for the PR and adapting it to arkworks!

Quick question: is the GPU scalar multiplication used to implement any operations (fixed/variable base MSMs/batch subgroup checks/batch muls)? I couldn't figure it out at a quick glance.

So, this is specifically for batch multiplication. One thing I would like to improve is to create a more general framework for incorporating GPU kernels of all sorts, (including variablebaseMSM etc), which require instantiation at the parameter level due to needing to compile for specific curves without generics.

In the future, perhaps something so ugly will not have to be relied upon, e.g. if something like EmbarkStudios/rust-gpu#387 comes to be.

Also, I'd really appreciate it if you could you could point out the parts that need the closest review. (Since this is a large PR, if you could suggest a "review plan" so that I can figure out how to navigate the PR, that would also be great.)

I will do this.

@jon-chuang jon-chuang force-pushed the jonch/celo-merge branch 3 times, most recently from a613ac8 to fe7faef Compare March 15, 2021 13:08
@jon-chuang
Copy link
Contributor Author

Currently I'm restarting this PR and will break into 4 components:

  1. batch inversion group arithmetic
  2. glv framework and scripts
  3. things using batch inversion-based methods, like batch verification and batch MSM
  4. cuda batch scalar mul

Number 4 is low priority as the code is spaghetti and also the API is very uncertain.

This PR is considered number 4, and I will stage the PRs so they build atop one another. Since this introduces breaking changes with curves, the corresponding curves changes will also be made.

I've decided I will complete this PR, as sickening as it is. I've already wasted enough time on porting this to arkworks that I decided I should waste even more time to at least get it done.

I'm very sickened that instead of having to write code and get things done, I have to waste time doing stupid refactoring and changing things everywhere. What a useless waste of time.

@jon-chuang
Copy link
Contributor Author

I'm thinking of migrating the GPU segments of the code towards rust-gpu, partly because

  1. I don't feel this segment is particularly urgent
  2. The workarounds to use accel and also the modifications requiring maintaining a segment of it are not worthwhile.
  3. accel = spaghetti code
  4. rust-gpu is obviously the future ;P

So there you have it. The timeline for rust-gpu to become mature enough coincides with my imagined timeline for the development of a robust gpu component to arkworks.

The major targets are compute intensive workloads like proving (MSM), batch scalar muls (currently implemented), batch verifications (on curve checks, signatures). Currently, such functionality centers around small number of functions.

Other potential things are FFTs.

Something else I would like to look into is how to combine something like rayon with rust-gpu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants