Multi gpu #729

frobnitzem · 2022-07-14T19:24:07Z

This PR adds support for the CupyOps.device_id by creating an Ops.context() interface that wraps a cupy.cuda.Device.

It was tested with a multi-GPU chain model here:
https://code.ornl.gov/99R/autoencoder.

The above test works with the current thinc after one change to CupyOps.asarray -- forcing return self.xp.asarray(data, dtype) for cupy inputs.

After merging this PR, the above example simplifies to: https://code.ornl.gov/99R/autoencoder/-/tree/multi_gpu_thinc.

The example outputs for a 3-GPU run are:

# multi_gpu branches, num_gpu = 3, size = 7000
$ jsrun -n1 -g3 -c21 -bpacked:21 python3 autoencoder.py
Initialized model with input dimension nI=7000 and output dimension nO=7000
0: iteration time = 3.619584083557129 mem = 697 M / gpu 0
1: iteration time = 3.61960506439209 mem = 969 M / gpu 1
2: iteration time = 3.6196165084838867 mem = 647 M / gpu 2
0: iteration time = 0.3242201805114746 mem = 0 M / gpu 0
1: iteration time = 0.32422947883605957 mem = 0 M / gpu 1
2: iteration time = 0.3242321014404297 mem = 0 M / gpu 2
0: iteration time = 0.31447744369506836 mem = 0 M / gpu 0
1: iteration time = 0.31448888778686523 mem = 0 M / gpu 1
2: iteration time = 0.3144950866699219 mem = 0 M / gpu 2
0: iteration time = 0.31055688858032227 mem = 0 M / gpu 0
1: iteration time = 0.31056880950927734 mem = 0 M / gpu 1
2: iteration time = 0.310575008392334 mem = 0 M / gpu 2
0: main time = 9.183293342590332 mem = 961 M / gpu 0
1: main time = 9.183251857757568 mem = 1211 M / gpu 1
2: main time = 9.183244705200195 mem = 808 M / gpu 2

This PR uses encapsulation of cupy array math in these contexts to remove warnings (encountered during testing) about non-cupy.cuda.Device()-context bound accesses:

thinc/thinc/backends/cupy_ops.py:256: PerformanceWarning: The device where the array resides (1) is different from the current device (0). Peer access has been activated automatically.
  gradient *= cupy.minimum(threshold, grad_norm) / grad_norm
thinc/thinc/backends/cupy_ops.py:343: PerformanceWarning: The device where the array resides (1) is different from the current device (0). Peer access has been activated automatically.
  adam_kernel(
...

closes #713

honnibal · 2022-07-15T09:23:40Z

Thanks for looking at this!

While I understand that multi-GPU models are necessary for some workloads, it's not trivial to make sure that we come out ahead of a single GPU in terms of performance. And in the meantime changes like this can add overhead to both the code and execution that impacts the single-GPU case, which is what almost everyone will be using.

We therefore want to think carefully and do our own experiments before we take on this topic. We'd likely target data-parallel multi-GPU workloads rather than model parallelism, as that's the most common requirement people will have for the main usage of Thinc, which is models for spaCy.

Unfortunately that means this isn't really a good target for user contribution. It's a change that will have significant impact for the library, so we can only get to this when we're really ready to work on it.

frobnitzem · 2022-07-15T13:56:31Z

I expect the performance impact to come primarily from python function call overhead when encountering context __enter__ and __exit__ calls. Those happen once for each ops primitive.

For CPU, the context itself does nothing. For GPU, both methods end up calling cudaSetDevice. It is fairly common to see codes that call this on every blas operation (e.g. page 2 of SLATE port to AMD and Intel platforms). From NVIDIA's docs, it is described as:

This function will do no synchronization with the previous or new device, and should be considered a very low overhead call.

The HIP docs go further,

This function does no synchronization with the previous or new device, and has very little runtime overhead. Applications can use hipSetDevice to quickly switch the default device before making a HIP runtime call which uses the default device.
The default device is stored in thread-local-storage for each thread. Thread-pool implementations may inherit the default device of the previous thread. A good practice is to always call hipSetDevice at the start of HIP coding sequency to establish a known standard device.

Note also that the change to CupyOps.asarray has no overhead, since both cupy.asarray and numpy.asarray state that no copy is performed if the array already matches the call's request.

svlandeg · 2022-07-25T12:22:15Z

Considering @honnibal's earlier reply, I'll go ahead and close this particular PR, but we'll add the idea to our backlog of things to explore further. Thanks again!

frobnitzem added 2 commits July 14, 2022 14:02

Added cupy.cuda.Device context manager to CupyOps

7437dd8

Increased support for CupyOps.device_id

fc49dc2

svlandeg added the feat / ops Backends and maths label Jul 25, 2022

svlandeg closed this Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi gpu #729

Multi gpu #729

frobnitzem commented Jul 14, 2022

honnibal commented Jul 15, 2022

frobnitzem commented Jul 15, 2022

svlandeg commented Jul 25, 2022

Multi gpu #729

Multi gpu #729

Conversation

frobnitzem commented Jul 14, 2022

honnibal commented Jul 15, 2022

frobnitzem commented Jul 15, 2022

svlandeg commented Jul 25, 2022