Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding GPU Support #128

Open
varun19299 opened this issue Jun 13, 2021 · 12 comments
Open

Adding GPU Support #128

varun19299 opened this issue Jun 13, 2021 · 12 comments

Comments

@varun19299
Copy link

varun19299 commented Jun 13, 2021

Early ideas:

  1. Accept device flag for GF instances. If CUDA, use a cupy array.

  2. Use cupy.get_array_module for device agnostic code where possible.

  3. Pytorch-like .to(device): allow transferring between host and device(s). Internally this would just be a numpy{cupy}.asarray or Array.view(np/cp.ndarray) call.

  4. Most numpy functions in galois/field/linalg.py have corresponding cupy ones with identical syntax.

  5. Numba jit functions and ufuncs may require separate GPU implementations, especially if thread and block index need to be accessed.

@varun19299
Copy link
Author

I'll try adding a provision for 1) in the FieldArray class.

I will be testing it on Colab. Here's a notebook to get started (installs CuPy, Colab already has numpy+scipy+numba. Please check if you have turned on a GPU runtime).

@peter64
Copy link

peter64 commented Sep 17, 2021

Just wanted to say I'de be interested in trying the GPU support for cupy.matmul with GF 32 bit fieldarrays to see how they compare speed wise with regular cupy.matmul for uint32 data types. I was thinking of maybe using this library to matmul some large arrays in real time but it's too slow to do it as it currently stands. I expect even with cupy acceleration it's still going to be an order of magnitude slower than operating on native data types like uint32.

@mhostetter
Copy link
Owner

@peter64 thanks for the feedback. I'm not too surprised that matmul isn't the fastest currently.

Some clarifying questions:

  • What kind of finite field? GF(2^m) or GF(p^m)? The latter is significantly slower, and has some performance left to squeeze out, even on CPU.
  • Can you give me example matrix dimensions (e.g., (1000,2000) x (2000,3000))? I'd like to run some speed tests too.

What is your current slowdown as compared to normal integer matmul? 10x? 100x? I've seen that for GF(2^8) matrix multiplication is ~10x slower than normal integer matrix multiplication, as discussed here.

@peter64
Copy link

peter64 commented Sep 17, 2021

I'm using GF(p^m) I think. It's 2**8 (256). Honestly I just need this to do binary multiplication modulus 2 for some kind of entropy extractor I'm trying to reproduce from some paper. Here's some output from the tests I just ran using GF(256) as GF(2^32) wasn't completing in any reasonable period of time and the docs said using a smaller p^m value might mean it could use lookup tables so I gave it a try. In reality I would prefer to use GF(2^32) I guess.

>>> import numpy as np
>>> import galois
>>> import datetime
>>> GF = galois.GF(2**8)
>>> extractor_output_size = 1024
>>> input_data = GF.Random((2048,2048));
>>> extractor = GF.Random((2048,extractor_output_size));
>>> start = datetime.datetime.now()
>>> y = np.matmul(input_data, extractor)
>>> print(datetime.datetime.now()-start)
0:06:02.313386


>>> start = datetime.datetime.now()
>>> input_data_np, extractor_np = input_data.view(np.ndarray), extractor.view(np.ndarray)
>>> y_np = np.matmul(input_data_np, extractor_np)
>>> print(datetime.datetime.now()-start)
0:00:07.198755

>>> import cupy as cp
>>> start = datetime.datetime.now()
>>> input_data_cp, extractor_cp = cp.array(input_data_np), cp.array(extractor_np)
>>> y_cp = np.matmul(input_data_cp, extractor_cp)
>>> cp.cuda.Stream.null.synchronize()
>>> print(datetime.datetime.now()-start)
0:00:01.247846

>>> import cupy as cp
>>> start = datetime.datetime.now()
>>> input_data_cp, extractor_cp = cp.array(input_data_np), cp.array(extractor_np)
>>> y_cp = np.matmul(input_data_cp, extractor_cp)
>>> cp.cuda.Stream.null.synchronize()
>>> print(datetime.datetime.now()-start)
0:00:00.017759

>>> start = datetime.datetime.now()
>>> input_data_np, extractor_np = input_data.view(np.ndarray), extractor.view(np.ndarray)
>>> y_np = np.matmul(input_data_np, extractor_np)
>>> print(datetime.datetime.now()-start)
0:00:07.692863

So GF is about 50x-60x times slower than numpy 6 minutes vs ~7 seconds. cupy is about 5x faster than numpy on it's first run (with a GTX1050 and i7 quad core) but then cupy ends up being about 700x faster than numpy on subsequent runs.

@mhostetter
Copy link
Owner

@peter64 thanks for the example. Yes, that is slower than I would expect (which is ~10x slower than NumPy). Let me run some speed tests later today and maybe test a few potential speed ups. I'll report back.

@mhostetter
Copy link
Owner

@peter64 can you confirm which version you're using?

@peter64
Copy link

peter64 commented Sep 17, 2021

>>> print(galois.__version__)
0.0.21

Perhaps it's because my arrays are large enough they don't fit in the CPU cache or something...

@mhostetter
Copy link
Owner

@varun19299 and @peter64, I now have a GPU to test against. I'm considering starting work on GPU support. Do you have any updated thoughts on a desired API interface regarding transfer to/from GPU, etc? If not, I'll use my best judgment. Just wondering if you have given it any thought. Thanks.

@peter64
Copy link

peter64 commented Oct 19, 2021

hey @mhostetter thanks so much for asking, but I have no thoughts regarding a desired API. I can't promise I will end up using the library in the end, but I am very curious to see how it will perform and will be happy to test its performance! Thanks again for writing this library and being able to add GPU support!

@geostergiop
Copy link

Hi Matt, any news on this one? GPU support would be great for high-order calculations!

@mhostetter
Copy link
Owner

No update as of yet. It's going to be a big change, and just one I haven't embarked on yet. Perhaps soon.

Just curious, @geostergiop, what are you looking to speed up? I doubt "large" finite fields (those using dtype=np.object_) will improve with GPU support because currently I can't even JIT compile their ufuncs. Instead, they use pure-Python ufuncs that use Python integers (because they have unlimited size).

@geostergiop
Copy link

Well, I am currently calculating about 2,000 * 19,993 * 18,431 field_traces() and respective Norms over 2^256 and 2^233 elements so it takes some time, to say the least :-) Hoped to speed up the np.arange and exponent calculations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants