Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU implementation for multiexp and fft #35

Open
wants to merge 327 commits into
base: master
Choose a base branch
from

Conversation

chenhuan14
Copy link

I have implemented GPU accelerate for multiexp and FFT motived by the filecoin implementation, which can greatly improve the efficiency of prover. I use this implementation to accelerate the recursive snarks of PLONK, the native implementation of recursive_aggregation_circuit need nearly 6hours to generate a merge proof for 2 proofs, while our GPU implementation, it spent only 10 minutes.

@shamatar
Copy link
Member

Hey @chenhuan14

Before I even start to review it, can you tell me how you have got 6 hours of proving time for merging of two proofs? If you mean recursive aggregation it should be around 4m gates to aggregate two proofs, that is provable in minutes on a 6 physical core laptop. I hope you didn't run the prover in debug build? This can result in 6 hours easily

@chenhuan14
Copy link
Author

Hey @chenhuan14

Before I even start to review it, can you tell me how you have got 6 hours of proving time for merging of two proofs? If you mean recursive aggregation it should be around 4m gates to aggregate two proofs, that is provable in minutes on a 6 physical core laptop. I hope you didn't run the prover in debug build? This can result in 6 hours easily

Thanks for your reply. I'm a beginner of Rust language, and run the prover in debug model. In the release model, the GPU acceleration only gain 70% compared to the CPU implementation.

@shamatar
Copy link
Member

I actually did the same mistake myself in the beginning. What you can expect is well below 30m proving time for any plonk circuit over BN254 (Ethereum curve) if you use 16 physical core machine

@shamatar
Copy link
Member

I'm not the GPU specialist, but started to review the FFT part and will have some comments. Multiexp is even more challenging, so most likely I'll do it someway during the holidays

@shamatar
Copy link
Member

Ok, here are my comments so far:

  • Can you make a PR against "dev" branch? That is the one actively developed and will replace "master" completely
  • Can you also mark all the new dependencies as "optional" under the "gpu" feature?
  • I do not have a PC with GPU available for testing, so can you also make few scripts that will allow to run it with 1-2 command line instructions in the cloud?
  • Ideally can you make a benchmarks on isolated operations for CPU vs GPU?
  • Can you also make a multiexp that is universal trusted-setup friendly and only loads points once, and uses different scalars potentially? It would require some stateful proxy
  • Separate G1 and G2 operations (I'm not sure if creating separate buffers for them matters, but would like to not use more memory than required if we are only interested in G1)
  • Right now GPU multiexp is a sync function. Can you make it a WorkerFuture, so one can simultaneously use CPUs for other work without thread::spawn?

@shamatar
Copy link
Member

Kind of a separate concern: you use a lockfiles to ensure an exclusive access to the device. Not sure about the implementation, so what would happen if:

  • there are two binaries started from different folders?
  • there are >1 GPUs in the servers, so can one binary use one of them and another one use another?

@chenhuan14
Copy link
Author

Thanks for your good advices. I will try to optimize this work in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants