GPU implementation for multiexp and fft #35

chenhuan14 · 2020-12-13T12:26:02Z

I have implemented GPU accelerate for multiexp and FFT motived by the filecoin implementation, which can greatly improve the efficiency of prover. I use this implementation to accelerate the recursive snarks of PLONK, the native implementation of recursive_aggregation_circuit need nearly 6hours to generate a merge proof for 2 proofs, while our GPU implementation, it spent only 10 minutes.

returning some very useful functions

implement multiexp GPU accerate

gpu accerate

with some test

ifft_using_bitreversed_ntt using bit_rev_best_ct_ntt_2_best_fft_gpu

bitreversed_lde_using_bitreversed_ntt by using bit_rev_best_ct_ntt_2_best_fft_gpu

shamatar · 2020-12-13T12:58:38Z

Hey @chenhuan14

Before I even start to review it, can you tell me how you have got 6 hours of proving time for merging of two proofs? If you mean recursive aggregation it should be around 4m gates to aggregate two proofs, that is provable in minutes on a 6 physical core laptop. I hope you didn't run the prover in debug build? This can result in 6 hours easily

chenhuan14 · 2020-12-14T13:03:42Z

Hey @chenhuan14

Before I even start to review it, can you tell me how you have got 6 hours of proving time for merging of two proofs? If you mean recursive aggregation it should be around 4m gates to aggregate two proofs, that is provable in minutes on a 6 physical core laptop. I hope you didn't run the prover in debug build? This can result in 6 hours easily

Thanks for your reply. I'm a beginner of Rust language, and run the prover in debug model. In the release model, the GPU acceleration only gain 70% compared to the CPU implementation.

shamatar · 2020-12-14T15:35:04Z

I actually did the same mistake myself in the beginning. What you can expect is well below 30m proving time for any plonk circuit over BN254 (Ethereum curve) if you use 16 physical core machine

shamatar · 2020-12-17T11:38:42Z

I'm not the GPU specialist, but started to review the FFT part and will have some comments. Multiexp is even more challenging, so most likely I'll do it someway during the holidays

shamatar · 2021-01-26T18:00:21Z

Ok, here are my comments so far:

Can you make a PR against "dev" branch? That is the one actively developed and will replace "master" completely
Can you also mark all the new dependencies as "optional" under the "gpu" feature?
I do not have a PC with GPU available for testing, so can you also make few scripts that will allow to run it with 1-2 command line instructions in the cloud?
Ideally can you make a benchmarks on isolated operations for CPU vs GPU?
Can you also make a multiexp that is universal trusted-setup friendly and only loads points once, and uses different scalars potentially? It would require some stateful proxy
Separate G1 and G2 operations (I'm not sure if creating separate buffers for them matters, but would like to not use more memory than required if we are only interested in G1)
Right now GPU multiexp is a sync function. Can you make it a WorkerFuture, so one can simultaneously use CPUs for other work without thread::spawn?

shamatar · 2021-01-27T17:37:51Z

Kind of a separate concern: you use a lockfiles to ensure an exclusive access to the device. Not sure about the implementation, so what would happen if:

there are two binaries started from different folders?
there are >1 GPUs in the servers, so can one binary use one of them and another one use another?

chenhuan14 · 2021-01-28T02:25:44Z

Thanks for your good advices. I will try to optimize this work in the near future.

shamatar and others added 30 commits March 25, 2020 11:59

start testing on large circuits

a7cbb00

finished test on a large circut. Now optimize t-poly

4aedc02

40% lde speedup

343a7c1

cleanup and few experiments

c28d7c8

chunked prover

3856029

copy-paste partial reduction from experimental repo

563c345

some cleanup and functions for bitreversed forms

2f4c15d

change FRI definisions to port coset combining FRI from experimental

6163879

some draft redshift implementation

c864ac1

fix/check correctness

03ae5a4

redshift and plonk with optimizatioins

400a07a

place benchmarks, remove verbosity

0c58688

move blake2 locally

4195dde

push blake2

9e47d2c

fix path

2850109

fix circuit sizes

5f42378

fix benchmark sizes

e6cf658

fix coset schedule

d6c7430

don't test too large sizes

b7b999c

use required number of bases

75c15ab

remove more verbosity

305efcb

unchecked transpiler

3a13d0a

unchecked transpiler (right version)

63fb68e

start implementing alternative approach to transpiling

c2b5d5f

initial transpiler routines

da9089a

push initial transpiler

d93e048

polish and fix

9cd10b6

succesfully transpiles XOR circuit

b41ab3a

proving on XOR must also work

24f5915

make some parts public

673becb

shamatar and others added 23 commits October 23, 2020 15:52

some cleanup

f1f129c

streamline features

89b28d0

remove experimental code

1948501

fix worker imports

d13dd33

fix multicore feature

4eb8ff6

with_traits

2f3fd02

with_traits2

6a966d6

Merge pull request matter-labs#31 from matter-labs/with_traits

5e38cc4

returning some very useful functions

have multicore and plonk by default in dev

c8ccf38

Merge branch 'dev' of https://github.com/matter-labs/bellman into dev

2b233f6

make fpga test vectors for full tests

15e17f7

fix edge case of transpilation

4844115

GPU v0.1

5d9f8aa

implement multiexp GPU accerate

implement dense_multiexp_gpu

5ddccba

commit_using_monomials(gpu)

ad34ba5

gpu accerate

implement fft_gpu for Engine

5f26e80

fft gpu acceration

c4db2fd

implement best_ct_ntt_2_best_fft_gpu

b3d69ff

with some test

add some test

25dc473

Phase1: change ntt to fft for using gpu

a9b7e40

ifft_using_bitreversed_ntt using bit_rev_best_ct_ntt_2_best_fft_gpu

phase 2: ntt_2_fft for using gpu

35d7c41

bitreversed_lde_using_bitreversed_ntt by using bit_rev_best_ct_ntt_2_best_fft_gpu

phase3: change ntt to fft for using gpu

6e8c1d4

change println to log.info

527eb0e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU implementation for multiexp and fft #35

GPU implementation for multiexp and fft #35

chenhuan14 commented Dec 13, 2020

shamatar commented Dec 13, 2020

chenhuan14 commented Dec 14, 2020

shamatar commented Dec 14, 2020

shamatar commented Dec 17, 2020

shamatar commented Jan 26, 2021

shamatar commented Jan 27, 2021

chenhuan14 commented Jan 28, 2021

GPU implementation for multiexp and fft #35

Are you sure you want to change the base?

GPU implementation for multiexp and fft #35

Conversation

chenhuan14 commented Dec 13, 2020

shamatar commented Dec 13, 2020

chenhuan14 commented Dec 14, 2020

shamatar commented Dec 14, 2020

shamatar commented Dec 17, 2020

shamatar commented Jan 26, 2021

shamatar commented Jan 27, 2021

chenhuan14 commented Jan 28, 2021