Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Communication with Cupy #100

Open
BigBSB opened this issue Dec 9, 2021 · 14 comments
Open

Communication with Cupy #100

BigBSB opened this issue Dec 9, 2021 · 14 comments

Comments

@BigBSB
Copy link

BigBSB commented Dec 9, 2021

Is it possible to have arrays already stored on the GPU as cupy.ndarray objects be used for the fitting routines? This is using the python wheel.

@jkfindeisen
Copy link
Collaborator

The location on the input/output data can only be specified in the C++ interface, but not in the C interface, So I guess that will not work. It could work though with some changes to the code.

@BigBSB
Copy link
Author

BigBSB commented Dec 9, 2021

Cupy has an option to run custom CUDA kernels; would this feature be useful to get it working?

@SBresler
Copy link

SBresler commented Jul 13, 2022

Hey I ended up figuring this out. Is there a way to update the master branch?

Basically this lets you do all of your data processing on the GPU, and then immediately push it to GPUfit to do the fits without having to transfer the data back to the CPU.

Next thing I want to figure out is how to JIT compile fit models so that you can construct a fit model based on various parameters and then have it run though GPUfit, don't know if this is possible.

@superchromix
Copy link
Collaborator

Sounds great. The best way to do this is fork the repository, include your changes, and submit a pull request.

@SBresler
Copy link

SBresler commented Jul 18, 2022

Ok, well the cupy works.

I actually don't think I care about the JIT stuff, it looks like ya'll tried that and there were speed hits.

How about fits that involve complex numbers? would that be a difficult addition?

@casparvitch
Copy link
Contributor

@SBresler I don't see any changes to your fork. Can you share with us how you implemented cupy interfacing, I (and I imagine others) would find this very interesting/useful! Cheers.

@SBresler
Copy link

SBresler commented Oct 1, 2022 via email

@SBresler
Copy link

SBresler commented Dec 2, 2022

Alright, I have come back to this.

What I did was cast a pointer to to the cupy object in the python interface.

This allowed me to put in a cupy object as an argument to the fit function call.

Looking at the traces through nvtx, there is still a lot of copying... stuff happening during that block.

I was looking at jaxFit, which I could more realistically make modifications to since I have a lot more python knowledge than C++ at this point, but to the me program is more focused on extremely large, complex fits, whereas gpufit is all about doing a ton of small fits at once.

This might be personal bias because it's exactly my use-case, but my feeling is that it's really obvious that if you have a ton of small data like this that you want to fit, that the most obvious possible improvement at the moment for gpufit is to allow for data that is already in global memory on device to be accessed.

At the moment I am doing ~3GB/s transfers to the gpu for FFTs and then some reduction operations, and it's working relatively well, but the bottlenecks are always transfer times.

I just thought of this - maybe it's easier to just do it the other way and put in all of my preprocessing into gpufit instead.

I am streaming a LOT of data through a digitizer at the moment (3GB/s), and have gotten fits continuously for about 10 seconds, and I am fairly certain that eliminating one or both of these copies blows the problem apart (RDMA for the digitizer takes away one transfer, accessing global memory for the gpufit calls takes away 2 transfers, and my reduction is about a factor of 2.5)

@SBresler
Copy link

SBresler commented Dec 2, 2022

Another thought -

What if you want to use RDMA to get the data to the gpu faster by bypassing the whole read the data into cpu ram over the pci-e bus (hard drive or otherwise), pinning the address, transferring... et cetera.

this would mean that you fundamentally cannot use gpufit and RDMA in the same application.

@SBresler
Copy link

SBresler commented Dec 2, 2022

Another idea - add a preprocessing section that allows you to add in your own kernel that you want to add in what you want to do before the fit.

This could work as a stopgap.

@superchromix
Copy link
Collaborator

Hi. Fitting data that is already stored in the GPU memory is already implemented in Gpufit. The docs are here: https://gpufit.readthedocs.io/en/latest/gpufit_api.html#gpufit-cuda-interface .

As you found out, when working with Python, you need to obtain a pointer to a GPU memory location to use the gpufit_cuda_interface call. Gpufit knows nothing about python or numpy arrays, etc.

The pre-processing you're talking about could be implemented as a separate routine. You can do anything you want with the data stored on the GPU before and after calling Gpufit. Gpufit is simply meant to handle the fit step.

Finally, we tried real-time compilation of fit model functions, and this caused major performance bottlenecks. It would clearly be a great feature to have. This topic may be revisited in the future.

@SBresler
Copy link

SBresler commented Dec 6, 2022

wow this is why you have to be persistent and keep asking!

so either this is new, or I just was going off some other information found in other posts that wasn't entirely accurate. I don't see a way to look at old docs but that would be interesting to find out.

Thanks so much for the information. I can work with this. It was blowing my mind that this wasn't a feature and it totally is.

@SBresler
Copy link

SBresler commented Dec 6, 2022

Hi. Fitting data that is already stored in the GPU memory is already implemented in Gpufit. The docs are here: https://gpufit.readthedocs.io/en/latest/gpufit_api.html#gpufit-cuda-interface .

As you found out, when working with Python, you need to obtain a pointer to a GPU memory location to use the gpufit_cuda_interface call. Gpufit knows nothing about python or numpy arrays, etc.

The pre-processing you're talking about could be implemented as a separate routine. You can do anything you want with the data stored on the GPU before and after calling Gpufit. Gpufit is simply meant to handle the fit step.

Finally, we tried real-time compilation of fit model functions, and this caused major performance bottlenecks. It would clearly be a great feature to have. This topic may be revisited in the future.

Interesting.

when you say "major performance bottlenecks" are you talking more than an order of magnitude speed decrease?

I think that scientists are generally hungry for faster fitting routines and almost anything beats the speed of LMfit.

@SBresler
Copy link

SBresler commented Dec 16, 2022

I have a version now which does the following:

  1. sets up the method in the dll corresponding to the constarined cuda interface function
  2. checks that input cupy arrays are c-contiguous
  3. gets the pointer for the cupy arrrays
  4. sends that over to gpufit.

So I think this is a lot closer to what I want. I will do a PR at some point for this.

An idea I have been toying with is to expose all of the functions to pybind11 rather than using ctypes - this seems to be the tool of choice for a lot of people.

This would give you access to pytest for unit testing in gpufit - i think that the python interface is by far the most important aspect of this for any sort of widespread adoption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants