Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi gpu #729

Closed
wants to merge 2 commits into from
Closed

Multi gpu #729

wants to merge 2 commits into from

Conversation

frobnitzem
Copy link
Contributor

This PR adds support for the CupyOps.device_id by creating an Ops.context() interface that wraps a cupy.cuda.Device.

It was tested with a multi-GPU chain model here:
https://code.ornl.gov/99R/autoencoder.

The above test works with the current thinc after one change to CupyOps.asarray -- forcing return self.xp.asarray(data, dtype) for cupy inputs.

After merging this PR, the above example simplifies to: https://code.ornl.gov/99R/autoencoder/-/tree/multi_gpu_thinc.

The example outputs for a 3-GPU run are:

# multi_gpu branches, num_gpu = 3, size = 7000
$ jsrun -n1 -g3 -c21 -bpacked:21 python3 autoencoder.py
Initialized model with input dimension nI=7000 and output dimension nO=7000
0: iteration time = 3.619584083557129 mem = 697 M / gpu 0
1: iteration time = 3.61960506439209 mem = 969 M / gpu 1
2: iteration time = 3.6196165084838867 mem = 647 M / gpu 2
0: iteration time = 0.3242201805114746 mem = 0 M / gpu 0
1: iteration time = 0.32422947883605957 mem = 0 M / gpu 1
2: iteration time = 0.3242321014404297 mem = 0 M / gpu 2
0: iteration time = 0.31447744369506836 mem = 0 M / gpu 0
1: iteration time = 0.31448888778686523 mem = 0 M / gpu 1
2: iteration time = 0.3144950866699219 mem = 0 M / gpu 2
0: iteration time = 0.31055688858032227 mem = 0 M / gpu 0
1: iteration time = 0.31056880950927734 mem = 0 M / gpu 1
2: iteration time = 0.310575008392334 mem = 0 M / gpu 2
0: main time = 9.183293342590332 mem = 961 M / gpu 0
1: main time = 9.183251857757568 mem = 1211 M / gpu 1
2: main time = 9.183244705200195 mem = 808 M / gpu 2

This PR uses encapsulation of cupy array math in these contexts to remove warnings (encountered during testing) about non-cupy.cuda.Device()-context bound accesses:

thinc/thinc/backends/cupy_ops.py:256: PerformanceWarning: The device where the array resides (1) is different from the current device (0). Peer access has been activated automatically.
  gradient *= cupy.minimum(threshold, grad_norm) / grad_norm
thinc/thinc/backends/cupy_ops.py:343: PerformanceWarning: The device where the array resides (1) is different from the current device (0). Peer access has been activated automatically.
  adam_kernel(
...

closes #713

@honnibal
Copy link
Member

Thanks for looking at this!

While I understand that multi-GPU models are necessary for some workloads, it's not trivial to make sure that we come out ahead of a single GPU in terms of performance. And in the meantime changes like this can add overhead to both the code and execution that impacts the single-GPU case, which is what almost everyone will be using.

We therefore want to think carefully and do our own experiments before we take on this topic. We'd likely target data-parallel multi-GPU workloads rather than model parallelism, as that's the most common requirement people will have for the main usage of Thinc, which is models for spaCy.

Unfortunately that means this isn't really a good target for user contribution. It's a change that will have significant impact for the library, so we can only get to this when we're really ready to work on it.

@frobnitzem
Copy link
Contributor Author

I expect the performance impact to come primarily from python function call overhead when encountering context __enter__ and __exit__ calls. Those happen once for each ops primitive.

For CPU, the context itself does nothing. For GPU, both methods end up calling cudaSetDevice. It is fairly common to see codes that call this on every blas operation (e.g. page 2 of SLATE port to AMD and Intel platforms). From NVIDIA's docs, it is described as:

This function will do no synchronization with the previous or new device, and should be considered a very low overhead call.

The HIP docs go further,

This function does no synchronization with the previous or new device, and has very little runtime overhead. Applications can use hipSetDevice to quickly switch the default device before making a HIP runtime call which uses the default device.
The default device is stored in thread-local-storage for each thread. Thread-pool implementations may inherit the default device of the previous thread. A good practice is to always call hipSetDevice at the start of HIP coding sequency to establish a known standard device.

Note also that the change to CupyOps.asarray has no overhead, since both cupy.asarray and numpy.asarray state that no copy is performed if the array already matches the call's request.

@svlandeg svlandeg added the feat / ops Backends and maths label Jul 25, 2022
@svlandeg
Copy link
Member

Considering @honnibal's earlier reply, I'll go ahead and close this particular PR, but we'll add the idea to our backlog of things to explore further. Thanks again!

@svlandeg svlandeg closed this Jul 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / ops Backends and maths
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Model.to_gpu is not usable
3 participants