Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure GPU index for 'astra_cuda', select GPU currently used by PyTorch in OperatorModule #1546

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

jleuschn
Copy link
Contributor

This pull request implements two new features:

  • Choice of GPU for 'astra_cuda' via settable property 'gpu_index' in RayTransformBase
  • Automatically select GPU currently used by PyTorch in OperatorModule by setting 'gpu_index'

The second feature feels a little hacky, since it assumes the special role of the 'gpu_index' property if existing for any Operator instance that is wrapped by the OperatorModule. However this seems to me to be the most non-invasive way to implement this behaviour, since otherwise probably RayTransformBase would have to know about torch, e.g. offering gpu_index = 'torch_current'.

@kohr-h
Copy link
Member

kohr-h commented Mar 30, 2020

Hey @jleuschn, thanks a lot for your effort! From a coding point of view, it looks very good.

However, I'm afraid that this solution is too hacky and not future-proof. That's mostly due to circumstances and not your fault. Let me give my reasoning here.

  1. We have already a couple of times opted against introducing writable properties in Operator (and the majority of other classes). It's a recipe for bugs and makes it harder to reason about a given instance. Instead, we have chosen to always generate a new instance from scratch with the relevant parts changed (as part of the input parameters). See for instance TensorSpace.astype. That comes with its own troubles, but in the worst case a new attribute won't be copied, which is usually not as bad as an old one not being cleared.

  2. We don't have cloning semantics for Operator. For some operators it would be easy, but more than a few build up internal state, and what do we do with that? You also realized that you need to clear the cache of a ray transform to make it work, and any change of RayTransform in the caching will have to make the cloning function aware. That's a maintenance burden I'd like to avoid.
    Another reason for not having cloning semantics is the incurred cost. It looks like a cheap operation on the surface, but it can be very expensive numerically, e.g., MultiplyOperator on a large space. Not having cloning semantics is a hint that cloning might not be a good idea.

  3. Having a gpu_index parameter in RayTransform directly is somewhat problematic since it's only valid as option if a GPU-based implementation is chosen. In principle a call like RayTransform(..., impl='astra_cpu', gpu_index=0) doesn't make sense and should fail. That would be ugly but could be done.
    What's also problematic is the prospect of a potentially conflicting property in the supplied spaces. There is a semi-complete attempt to support CuPy as backend for arrays (see ENH: add cupy tensor space #1231, has been dead for a long time but will be revived), and there the GPU index is tied to the tensor space. And since I don't anticipate any use for a ray transform that has input and output on GPU 0 but computes on GPU 1, those indices would be in conflict.
    So I'd like to find a way to manage the compute devices in a compatible and conflict-free manner. One way I could imagine would be to mimic CuPy's Device API in a minimal way, i.e., just the id part and the context manager. Later on it would be easy to drop in CuPy's Device instead. We could make it an "inofficial" API somewhere in astra_cuda.py for now.

  4. Finally, as you say, the solution for OperatorModule is quite hacky in itself and relies on more hacky things. That's a bit too much for my taste, even if we're in a contrib package.
    The main trouble is that for an already existing operator, it's hard to find out how to re-create it, and even harder to figure out how to create a valid variant of it. But there's no reason why it has to be that way. Creating an ODL Operator outside and then passing it to OperatorModule for mere wrapping is in no way better than the "obvious" alternative, namely passing a constructor plus arguments to OperatorModule, with the instantiation happening inside. That way it would be much easier (not trivial but easier) to create a new version of the same operator when needed.
    In principle, both modes could be supported, e.g. by a signature __init__(self, op_or_fact, *args, **kwargs). Or, what would be the cleaner solution IMO, by keeping the current __init__ and implementing an alternative constructor like

    @classmethod
    def from_op_factory(cls, op_fact, args=(), kwargs=None):
        if kwargs is None:
            kwargs = {}
        op = op_fact(*args, **kwargs)  # construct `Operator` instance
        instance = cls(op)  # construct `OperatorModule` instance
        # Add constructor details to instance (set to `None` in `__init__`)
        instance._op_fact = op_fact
        instance._op_args = args
        instance._op_kwargs = kwargs
        return instance

    This would be used as OperatorModule.from_op_factor(RayTransform, ...), and in _replicate_for_data_parallel the operators would be created anew. And finally, to make that work as expected, each of these constructor calls would be run in a with Device(dev_id) context to make them use the correct GPU ID.
    The final piece to make that work is to support call parameters in RayTransform that are passed on to the backend. That's a more generic and flexible approach compared to an immediate gpu_index, and it doesn't leak backend details into RayTransform.


Okay, this has become way longer than I expected, and I realize it will not be a trivial change. I will look into it tonight make a suggestion. But I have no way to test with multiple GPUs, so you would have to help me out @jleuschn. Does that sound okay?

@jleuschn
Copy link
Contributor Author

Thanks, @kohr-h , for checking the request and pointing out the issues above!
I agree with your points. 1.--3. seem quite possible to overcome. Regarding 4., the factory-based approach seems nice. Selecting the "correct" GPU seems a little bit more difficult to me, though:

While parameters and buffers are broadcasted to the GPUs in PyTorch's replicate call, modules are replicated using _replicate_for_data_parallel, which is agnostic of the target GPU ID (so we cannot assign the correct GPU index there trivially). Later in parallel_apply, they use the context with torch.cuda.device(device) around the call to the wrapped module, see here. (This context looks quite similar to the CuPy one, BTW.)
Thus, we need to recreate the Operator in the forward and backward call, where we can get the current torch device from torch.cuda.current_device(). Of course one could introduce a cache of Operator instances for all GPUs, so the constructor is effectively called only once for each GPU.

Yes, i can help out testing on multiple GPUs!
However, if you feel this is PR is not worth its time ATM because we are not using VRAM directly, i'm okay with that too. Otherwise i could contribute by implementing (parts of) it based on your suggestions :)

@kohr-h
Copy link
Member

kohr-h commented Mar 30, 2020

Good point about the replication thing! Hm, so the first call to forward is the first opportunity to create the operator with the appropriate GPU index. And with the reasonable expectation that the next call will be on the same device, we could cache the operator 🤔
Okay, that sounds slightly hacky but okay. We still need to special-case the ray transform to pass the right extra args to the constructor, but that's at least easy to extend.

Regarding the larger question of whether it's worth the effort. Currently I have my doubts. The whole thing is quite inefficient anyway since each ray transform does the whole roundtrip CPU->GPU->CPU no matter what, and if it's wrapped into an OperatorModule on the (or a) GPU, we get the nice chain GPU->CPU->GPU->CPU-> GPU. That will, of course, be much more interesting once the intermediate transfers can be eliminated. But at the moment, the speedup from the parallelization of the very middle part is the only possible gain. It might not be worth it. Did you run any speed tests?

@jleuschn
Copy link
Contributor Author

Yes, i ran some speed test. It seems that it only makes a difference if the GPUs are heavily used already by the rest of the network. TBH, i don't fully understand why, considering the mentioned chain, maybe the chains are not in sync between the different GPUs, so some ray trafo runs in parallel to another layer?
So back to the testing, i artificially made some learned primal-dual model with stupidly many parameters in order to get a high GPU utilization, and then running 'astra_cuda' on different GPUs saved 1/3 of the computation time. Which is notable, but not that essential IMHO. So maybe it would be better to wait with this.

@kohr-h
Copy link
Member

kohr-h commented Mar 31, 2020

Very good, thanks for doing the speed test! Indeed, the gain is not nothing, but certainly not what you would hope for when throwing N times the compute power at the problem. So I agree, for now it's not necessary to invest time, but it's good to know about this limitation and that we need to think about solutions at some point.
I'll suggest we keep both the issue and the PR open and come back to them later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
PR Status
Work in progress
Development

Successfully merging this pull request may close these issues.

None yet

2 participants