GPU computing of get_distance_matrix? #371

maclariz · 2022-08-03T17:07:27Z

I was wondering if get_distance_matrix could go faster by using a GPU, which seems to be a possibility with dask functions. What do you think?

hakonanes · 2022-08-03T18:12:03Z

Orientation.get_distance_matrix() runs in parallel with both NumPy (lazy=False) and Dask (True) on my machine, does it not do so on yours?

With my use of the method, I usually fill the available memory before becoming impatient...

maclariz · 2022-08-03T19:30:38Z

I can set lazy=False in the method. I was not aware of any Dask setting, from your notes on the function, but dask is, of course, installed. Best wishes Ian On 3 Aug 2022, at 19:12, Håkon Wiik Ånes ***@***.******@***.***>> wrote: Orientation.get_distance_matrix() runs in parallel with both NumPy (lazy=False) and Dask (True) on my machine, does it not do so on yours? With my use of the method, I usually fill the available memory before becoming impatient... — Reply to this email directly, view it on GitHub<#371 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/APHUIKRC4SIEOFKSMH2PDR3VXKY75ANCNFSM55PS4DTQ>. You are receiving this because you authored the thread.Message ID: ***@***.***> Dr Ian MacLaren (he, him, his) BSc (Hons), PhD, FInstP, CPhys Reader in Physics Phone contact currently not advised due to Covid-19 and not being in my office regularly Materials and Condensed Matter Physics School of Physics and Astronomy University of Glasgow / Oilthigh Ghlaschu Glasgow G12 8QQ http://www.gla.ac.uk/schools/physics/research/groups/mcmp/ https://publons.com/researcher/C-1773-2010/ ORCID: 0000-0002-5334-3010 The University of Glasgow, charity number SC004401

hakonanes · 2022-08-03T19:35:57Z

The method reference is always a good place to see what the method can do!

Please close the issue if you're happy. If not, is there anything we should fix or improve?

maclariz · 2022-08-04T14:19:48Z

I tried lazy=False. The thing failed instantly due to memory requirements! This trick probably only works for very small datasets. I will have a think about the mathematics one day when I have time and see if there is a strategy that could be used to make this more efficient. I will close for now, but this seems an area where improvement should be possible.

hakonanes · 2022-08-04T14:25:01Z

This trick probably only works for very small datasets

Yes, this is unfortunately true. One simple approach is to allow a reduced floating point precision of 32-bit instead of the current 64-bit. I think seven decimals should be enough. This is not something the user can do by themselves, the code needs to change.

I will have a think about the mathematics one day when I have time and see if there is a strategy that could be used to make this more efficient

That would be great!

maclariz · 2022-08-05T07:22:03Z

@hakonanes One way you could speed this up is by adding a CUDA variable to the function with a Boolean input.

If False, works as it did.

If True, replace all calls on array operations from np to cp (with import cupy as cp).

So for example, cp.tensordot.

Obviously only helps if you have a GPU set up for processing, but would help some users of higher end systems.

hakonanes · 2022-08-05T07:50:30Z

I agree that supporting some computations on GPU using CuPy would be benefitial. Perhaps a good approach is to start small, supporting it only on this method, and then develop a framework with time. I don't know.

CuPy needs an NVIDIA GPU, and I have an Intel graphics card, meaning I cannot test this and would not benefit from working on this. I would be happy to review a PR, though.

If we start to support GPU computations with CuPy, it should be an optional dependency (via an orix[gpu] pip selector).

maclariz · 2022-08-05T08:00:33Z

I have an NVIDIA GPU and can test. I am running other GPU enabled functions using cupy and seeing big speedups.

Perhaps draw up a list of functions that need updating if we do this.

maclariz · 2022-08-05T08:28:36Z

So, following functions are supported by cp and can be ported:

einsum
arccos
nan_to_num
zeros
round
outer

I presume this is then possible.

maclariz · 2022-08-05T08:30:49Z

@hakonanes On starting small, in my limited experience with orix, this really is the only really memory and processor hungry operation there. Everything else is really quick for me, even on a laptop. But this function takes hours at the best of times...

So, this is perhaps the one obvious point for me where parallel processing on CUDA is really worthwhile.

hakonanes · 2022-08-05T08:43:28Z

I have an NVIDIA GPU and can test

That’s good. Actually, I have to backtrack, sorry, but I cannot review such a PR by myself as I don’t have an NVIDIA GPU to test the function… Would need help from someone who does (@harripj?).

And yes, supporting GPU computation in this function only is a good place to start in my opinion.

I’m unfamiliar with GPU processing using CUDA (only know pyopencl), but I assume this will not reduce memory use in any way.

maclariz · 2022-08-05T08:52:12Z

On memory use, if a larger map is used, as I found out the other day, the memory required could run to hundreds of GB. There is no way this works in a single chunk on most machines. As such, lazy processing with dask will be necessary for most cases. Thus, you could do it by chunking to moderate size chunks (e.g. 8-10 GB chunks with our GPU, which can take up to 12 GB), and each chunk goes much faster because it uses cupy.

It might be useful to have a wee prep function just to find out memory is needed for a given chunk size, to allow the user to guess a reasonable chunk size for the later computation that would fit with the memory they actually have available.

If you want testing, we can certainly help with our server. If needed, I could ask about a guest login from outside.

maclariz · 2022-08-05T17:14:02Z

I also had another idea. Working out literally every misorientation pair in the whole image is really overdoing the problem and probably totally unnecessary. Perhaps sampling every _n_th orientation in the dataset for comparison to the full set of pixels would reduce memory requirements and computing time by a factor n. So, you could safely do every 2nd pixel, and probably every third or fourth and get exactly the same results.

maclariz · 2022-08-05T17:15:25Z

Basically the same thing that makes SVD work - the problem is often rather oversampled and the same features turn up often in the dataset, so a subsampling will still find the same features as analysing every datapoint.

maclariz closed this as completed Aug 4, 2022

maclariz reopened this Aug 5, 2022

hakonanes changed the title ~~Parallel computing of get_distance_matrix?~~ GPU computing of get_distance_matrix? Aug 5, 2022

hakonanes added the enhancement New feature or request label Aug 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU computing of get_distance_matrix? #371

GPU computing of get_distance_matrix? #371

maclariz commented Aug 3, 2022

hakonanes commented Aug 3, 2022

maclariz commented Aug 3, 2022 via email

hakonanes commented Aug 3, 2022

maclariz commented Aug 4, 2022

hakonanes commented Aug 4, 2022

maclariz commented Aug 5, 2022

hakonanes commented Aug 5, 2022

maclariz commented Aug 5, 2022

maclariz commented Aug 5, 2022

maclariz commented Aug 5, 2022

hakonanes commented Aug 5, 2022

maclariz commented Aug 5, 2022

maclariz commented Aug 5, 2022

maclariz commented Aug 5, 2022

GPU computing of get_distance_matrix? #371

GPU computing of get_distance_matrix? #371

Comments

maclariz commented Aug 3, 2022

hakonanes commented Aug 3, 2022

maclariz commented Aug 3, 2022 via email

hakonanes commented Aug 3, 2022

maclariz commented Aug 4, 2022

hakonanes commented Aug 4, 2022

maclariz commented Aug 5, 2022

hakonanes commented Aug 5, 2022

maclariz commented Aug 5, 2022

maclariz commented Aug 5, 2022

maclariz commented Aug 5, 2022

hakonanes commented Aug 5, 2022

maclariz commented Aug 5, 2022

maclariz commented Aug 5, 2022

maclariz commented Aug 5, 2022