Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU computing of get_distance_matrix? #371

Open
maclariz opened this issue Aug 3, 2022 · 14 comments
Open

GPU computing of get_distance_matrix? #371

maclariz opened this issue Aug 3, 2022 · 14 comments
Labels
enhancement New feature or request

Comments

@maclariz
Copy link

maclariz commented Aug 3, 2022

I was wondering if get_distance_matrix could go faster by using a GPU, which seems to be a possibility with dask functions. What do you think?

@hakonanes
Copy link
Member

Orientation.get_distance_matrix() runs in parallel with both NumPy (lazy=False) and Dask (True) on my machine, does it not do so on yours?

With my use of the method, I usually fill the available memory before becoming impatient...

@maclariz
Copy link
Author

maclariz commented Aug 3, 2022 via email

@hakonanes
Copy link
Member

The method reference is always a good place to see what the method can do!

Please close the issue if you're happy. If not, is there anything we should fix or improve?

@maclariz
Copy link
Author

maclariz commented Aug 4, 2022

I tried lazy=False. The thing failed instantly due to memory requirements! This trick probably only works for very small datasets. I will have a think about the mathematics one day when I have time and see if there is a strategy that could be used to make this more efficient. I will close for now, but this seems an area where improvement should be possible.

@maclariz maclariz closed this as completed Aug 4, 2022
@hakonanes
Copy link
Member

This trick probably only works for very small datasets

Yes, this is unfortunately true. One simple approach is to allow a reduced floating point precision of 32-bit instead of the current 64-bit. I think seven decimals should be enough. This is not something the user can do by themselves, the code needs to change.

I will have a think about the mathematics one day when I have time and see if there is a strategy that could be used to make this more efficient

That would be great!

@maclariz maclariz reopened this Aug 5, 2022
@maclariz
Copy link
Author

maclariz commented Aug 5, 2022

@hakonanes One way you could speed this up is by adding a CUDA variable to the function with a Boolean input.

If False, works as it did.

If True, replace all calls on array operations from np to cp (with import cupy as cp).

So for example, cp.tensordot.

Obviously only helps if you have a GPU set up for processing, but would help some users of higher end systems.

@hakonanes
Copy link
Member

I agree that supporting some computations on GPU using CuPy would be benefitial. Perhaps a good approach is to start small, supporting it only on this method, and then develop a framework with time. I don't know.

CuPy needs an NVIDIA GPU, and I have an Intel graphics card, meaning I cannot test this and would not benefit from working on this. I would be happy to review a PR, though.

If we start to support GPU computations with CuPy, it should be an optional dependency (via an orix[gpu] pip selector).

@maclariz
Copy link
Author

maclariz commented Aug 5, 2022

I have an NVIDIA GPU and can test. I am running other GPU enabled functions using cupy and seeing big speedups.

Perhaps draw up a list of functions that need updating if we do this.

@maclariz
Copy link
Author

maclariz commented Aug 5, 2022

So, following functions are supported by cp and can be ported:

einsum
arccos
nan_to_num
zeros
round
outer

I presume this is then possible.

@maclariz
Copy link
Author

maclariz commented Aug 5, 2022

@hakonanes On starting small, in my limited experience with orix, this really is the only really memory and processor hungry operation there. Everything else is really quick for me, even on a laptop. But this function takes hours at the best of times...

So, this is perhaps the one obvious point for me where parallel processing on CUDA is really worthwhile.

@hakonanes
Copy link
Member

I have an NVIDIA GPU and can test

That’s good. Actually, I have to backtrack, sorry, but I cannot review such a PR by myself as I don’t have an NVIDIA GPU to test the function… Would need help from someone who does (@harripj?).

And yes, supporting GPU computation in this function only is a good place to start in my opinion.

I’m unfamiliar with GPU processing using CUDA (only know pyopencl), but I assume this will not reduce memory use in any way.

@hakonanes hakonanes changed the title Parallel computing of get_distance_matrix? GPU computing of get_distance_matrix? Aug 5, 2022
@hakonanes hakonanes added the enhancement New feature or request label Aug 5, 2022
@maclariz
Copy link
Author

maclariz commented Aug 5, 2022

On memory use, if a larger map is used, as I found out the other day, the memory required could run to hundreds of GB. There is no way this works in a single chunk on most machines. As such, lazy processing with dask will be necessary for most cases. Thus, you could do it by chunking to moderate size chunks (e.g. 8-10 GB chunks with our GPU, which can take up to 12 GB), and each chunk goes much faster because it uses cupy.

It might be useful to have a wee prep function just to find out memory is needed for a given chunk size, to allow the user to guess a reasonable chunk size for the later computation that would fit with the memory they actually have available.

If you want testing, we can certainly help with our server. If needed, I could ask about a guest login from outside.

@maclariz
Copy link
Author

maclariz commented Aug 5, 2022

I also had another idea. Working out literally every misorientation pair in the whole image is really overdoing the problem and probably totally unnecessary. Perhaps sampling every _n_th orientation in the dataset for comparison to the full set of pixels would reduce memory requirements and computing time by a factor n. So, you could safely do every 2nd pixel, and probably every third or fourth and get exactly the same results.

@maclariz
Copy link
Author

maclariz commented Aug 5, 2022

Basically the same thing that makes SVD work - the problem is often rather oversampled and the same features turn up often in the dataset, so a subsampling will still find the same features as analysing every datapoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants