Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

torch_xla compatibility option? #555

Open
adamcatto opened this issue Jun 21, 2022 · 1 comment
Open

torch_xla compatibility option? #555

adamcatto opened this issue Jun 21, 2022 · 1 comment

Comments

@adamcatto
Copy link

馃殌 Feature

An option to train models on a TPU or TPU pod using the torch_xla package.

Motivation & Examples

Motivation: speed up training, utilize best available resources.

Example: in vissl/vissl/trainer/trainer_main.py, start by changing SelfSupervisionTrainer.setup_distributed(self, use_gpu) to something like SelfSupervisionTrainer.setup_distributed(self, device), then encapsulate TPU training setup if device == 'TPU', or something along these lines. Relevant changes to other functions can be made afterwards.

(Note: I will likely start working on this; I am new to VISSL, so I figure a regular contributor might be better-equipped to handle this, but I can give it a go nonetheless.)

@QuentinDuval
Copy link
Contributor

Hey @adamcatto,

Thanks a lot for raising the point :)

So to be fair, we did have a look last year at PyTorch XLA to see if we could get something out of it, but did not do so for several reasons: PyTorch/XLA was still relatively new and at that time we trained ConvNets on which GPUs are actually pretty good. But now that Visual Transformers are there in the codebase, it might indeed be worth looking into.

I am however for the moment not qualified enough with PyTorch/XLA and the TPU ecosystem to proceed with such changes (I understood that to run on TPU, it's more than just changing the device, the data loader, as well as other things such as the way to save the model or fetch data, or even just run jobs on GCP would have to be integrated). It is however part of my personal goals to play with those technologies, so that might change.

If you feel qualified on this, we can start to discuss what would need to be changed, what kind of test case you would like to move forward first, etc.

What do you think?
Quentin

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants