Adding GPU Support #128

varun19299 · 2021-06-13T12:56:23Z

Early ideas:

Accept device flag for GF instances. If CUDA, use a cupy array.
Use cupy.get_array_module for device agnostic code where possible.
Pytorch-like .to(device): allow transferring between host and device(s). Internally this would just be a numpy{cupy}.asarray or Array.view(np/cp.ndarray) call.
Most numpy functions in galois/field/linalg.py have corresponding cupy ones with identical syntax.
Numba jit functions and ufuncs may require separate GPU implementations, especially if thread and block index need to be accessed.

The text was updated successfully, but these errors were encountered:

varun19299 · 2021-06-20T07:23:46Z

I'll try adding a provision for 1) in the FieldArray class.

I will be testing it on Colab. Here's a notebook to get started (installs CuPy, Colab already has numpy+scipy+numba. Please check if you have turned on a GPU runtime).

peter64 · 2021-09-17T11:01:41Z

Just wanted to say I'de be interested in trying the GPU support for cupy.matmul with GF 32 bit fieldarrays to see how they compare speed wise with regular cupy.matmul for uint32 data types. I was thinking of maybe using this library to matmul some large arrays in real time but it's too slow to do it as it currently stands. I expect even with cupy acceleration it's still going to be an order of magnitude slower than operating on native data types like uint32.

mhostetter · 2021-09-17T11:19:14Z

@peter64 thanks for the feedback. I'm not too surprised that matmul isn't the fastest currently.

Some clarifying questions:

What kind of finite field? GF(2^m) or GF(p^m)? The latter is significantly slower, and has some performance left to squeeze out, even on CPU.
Can you give me example matrix dimensions (e.g., (1000,2000) x (2000,3000))? I'd like to run some speed tests too.

What is your current slowdown as compared to normal integer matmul? 10x? 100x? I've seen that for GF(2^8) matrix multiplication is ~10x slower than normal integer matrix multiplication, as discussed here.

peter64 · 2021-09-17T11:37:53Z

I'm using GF(p^m) I think. It's 2**8 (256). Honestly I just need this to do binary multiplication modulus 2 for some kind of entropy extractor I'm trying to reproduce from some paper. Here's some output from the tests I just ran using GF(256) as GF(2^32) wasn't completing in any reasonable period of time and the docs said using a smaller p^m value might mean it could use lookup tables so I gave it a try. In reality I would prefer to use GF(2^32) I guess.

>>> import numpy as np
>>> import galois
>>> import datetime
>>> GF = galois.GF(2**8)
>>> extractor_output_size = 1024
>>> input_data = GF.Random((2048,2048));
>>> extractor = GF.Random((2048,extractor_output_size));
>>> start = datetime.datetime.now()
>>> y = np.matmul(input_data, extractor)
>>> print(datetime.datetime.now()-start)
0:06:02.313386


>>> start = datetime.datetime.now()
>>> input_data_np, extractor_np = input_data.view(np.ndarray), extractor.view(np.ndarray)
>>> y_np = np.matmul(input_data_np, extractor_np)
>>> print(datetime.datetime.now()-start)
0:00:07.198755

>>> import cupy as cp
>>> start = datetime.datetime.now()
>>> input_data_cp, extractor_cp = cp.array(input_data_np), cp.array(extractor_np)
>>> y_cp = np.matmul(input_data_cp, extractor_cp)
>>> cp.cuda.Stream.null.synchronize()
>>> print(datetime.datetime.now()-start)
0:00:01.247846

>>> import cupy as cp
>>> start = datetime.datetime.now()
>>> input_data_cp, extractor_cp = cp.array(input_data_np), cp.array(extractor_np)
>>> y_cp = np.matmul(input_data_cp, extractor_cp)
>>> cp.cuda.Stream.null.synchronize()
>>> print(datetime.datetime.now()-start)
0:00:00.017759

>>> start = datetime.datetime.now()
>>> input_data_np, extractor_np = input_data.view(np.ndarray), extractor.view(np.ndarray)
>>> y_np = np.matmul(input_data_np, extractor_np)
>>> print(datetime.datetime.now()-start)
0:00:07.692863

So GF is about 50x-60x times slower than numpy 6 minutes vs ~7 seconds. cupy is about 5x faster than numpy on it's first run (with a GTX1050 and i7 quad core) but then cupy ends up being about 700x faster than numpy on subsequent runs.

mhostetter · 2021-09-17T12:56:01Z

@peter64 thanks for the example. Yes, that is slower than I would expect (which is ~10x slower than NumPy). Let me run some speed tests later today and maybe test a few potential speed ups. I'll report back.

mhostetter · 2021-09-17T12:57:13Z

@peter64 can you confirm which version you're using?

peter64 · 2021-09-17T13:00:24Z

>>> print(galois.__version__)
0.0.21

Perhaps it's because my arrays are large enough they don't fit in the CPU cache or something...

mhostetter · 2021-10-19T12:55:34Z

@varun19299 and @peter64, I now have a GPU to test against. I'm considering starting work on GPU support. Do you have any updated thoughts on a desired API interface regarding transfer to/from GPU, etc? If not, I'll use my best judgment. Just wondering if you have given it any thought. Thanks.

peter64 · 2021-10-19T13:09:53Z

hey @mhostetter thanks so much for asking, but I have no thoughts regarding a desired API. I can't promise I will end up using the library in the end, but I am very curious to see how it will perform and will be happy to test its performance! Thanks again for writing this library and being able to add GPU support!

geostergiop · 2022-03-12T09:56:06Z

Hi Matt, any news on this one? GPU support would be great for high-order calculations!

mhostetter · 2022-03-13T00:20:33Z

No update as of yet. It's going to be a big change, and just one I haven't embarked on yet. Perhaps soon.

Just curious, @geostergiop, what are you looking to speed up? I doubt "large" finite fields (those using dtype=np.object_) will improve with GPU support because currently I can't even JIT compile their ufuncs. Instead, they use pure-Python ufuncs that use Python integers (because they have unlimited size).

geostergiop · 2022-03-13T07:59:39Z

Well, I am currently calculating about 2,000 * 19,993 * 18,431 field_traces() and respective Norms over 2^256 and 2^233 elements so it takes some time, to say the least :-) Hoped to speed up the np.arange and exponent calculations.

mhostetter mentioned this issue Jul 1, 2021

Array of polynomials #112

Closed

mhostetter mentioned this issue Nov 13, 2022

Work on ufunc modes #159

Open

3 tasks

mhostetter pinned this issue Dec 9, 2022

scarlehoff mentioned this issue Aug 11, 2023

matmul or tensordot in GPU #500

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding GPU Support #128

Adding GPU Support #128

varun19299 commented Jun 13, 2021 •

edited

varun19299 commented Jun 20, 2021

peter64 commented Sep 17, 2021

mhostetter commented Sep 17, 2021

peter64 commented Sep 17, 2021

mhostetter commented Sep 17, 2021

mhostetter commented Sep 17, 2021

peter64 commented Sep 17, 2021

mhostetter commented Oct 19, 2021

peter64 commented Oct 19, 2021

geostergiop commented Mar 12, 2022

mhostetter commented Mar 13, 2022

geostergiop commented Mar 13, 2022

Adding GPU Support #128

Adding GPU Support #128

Comments

varun19299 commented Jun 13, 2021 • edited

varun19299 commented Jun 20, 2021

peter64 commented Sep 17, 2021

mhostetter commented Sep 17, 2021

peter64 commented Sep 17, 2021

mhostetter commented Sep 17, 2021

mhostetter commented Sep 17, 2021

peter64 commented Sep 17, 2021

mhostetter commented Oct 19, 2021

peter64 commented Oct 19, 2021

geostergiop commented Mar 12, 2022

mhostetter commented Mar 13, 2022

geostergiop commented Mar 13, 2022

varun19299 commented Jun 13, 2021 •

edited