Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple GPUs Available? #9

Open
KleinXin opened this issue Mar 18, 2022 · 7 comments
Open

Multiple GPUs Available? #9

KleinXin opened this issue Mar 18, 2022 · 7 comments

Comments

@KleinXin
Copy link

It seems the codes only support single GPU trainning.
Is it possible to train on multiple GPUs?
thx

@chou141253
Copy link
Owner

Yes, if you want to use multi-GPUs training, please change the model for parallel computing. https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html

add

model = nn.DataParallel(model)
before

model.to(device)

@KleinXin
Copy link
Author

KleinXin commented Mar 24, 2022

model = nn.DataParallel(model)

By following your suggestions, I used two GPUs to train the model swin-vit-p4w12. It gives errors as below

Traceback (most recent call last):
  File "train.py", line 414, in <module>
    train(args, epoch, model, scaler, optimizer, schedule, train_loader, save_distrubution=save_dist)
  File "train.py", line 153, in train
    losses, accuracys = model(datas, labels)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward
    return self.gather(outputs, self.output_device)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 181, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 78, in gather
    res = gather_map(outputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 73, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 70, in gather_map
    for k in out))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 70, in <genexpr>
    for k in out))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 73, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
TypeError: zip argument #1 must support iteration

Maybe Line 35 in train.py should also be changed because two GPUs are used.

def set_environment(args):

    args.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    train_set = ImageDataset(istrain=True, 
                            root=args.train_root,
                            data_size=args.data_size,
                            return_index=True)

Do you have any suggestions?thx

@chou141253
Copy link
Owner

Oh! We need to revise the model class so you can use the parallel function! We will work on this. Please wait for several days.

@KleinXin
Copy link
Author

Oh! We need to revise the model class so you can use the parallel function! We will work on this. Please wait for several days.

Waiting for your good news!

@chou141253
Copy link
Owner

Sorry to keep you waiting. New version support multi-gpus training.

@KleinXin
Copy link
Author

Sorry to keep you waiting. New version support multi-gpus training.

thx! I will have a try

@KleinXin
Copy link
Author

KleinXin commented May 1, 2022

Sorry to keep you waiting. New version support multi-gpus training.

Thank you for your work that makes multiple GPU trainning available!I still have two questions.

  1. Could you please give a config file of resnet50?

  2. Is it possible that Swin Transformer supports large resolution image trannning?

    I am trying to train the model on a Fine Grained Image Dataset. Larger input resolution may be helpful to the accuracy.

    I tried to set data_size to 960,but it gives an error below.

File "main.py", line 297, in <module>
    main(args, tlogger)
  File "main.py", line 249, in main
    train(args, epoch, model, scaler, amp_context, optimizer, schedule, train_loader)
  File "main.py", line 137, in train
    outs = model(datas)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work/Algorithms/Classification/FineGrained/PI_202204/models/pim_module/pim_module.py", line 407, in forward
    x = self.forward_backbone(x)
  File "/work/Algorithms/Classification/FineGrained/PI_202204/models/pim_module/pim_module.py", line 386, in forward_backbone
    return self.backbone(x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work/Algorithms/Classification/FineGrained/PI_202204/timm/models/swin_transformer.py", line 544, in forward
    x = self.forward_features(x)
  File "/work/Algorithms/Classification/FineGrained/PI_202204/timm/models/swin_transformer.py", line 534, in forward_features
    l1 = self.layers[0](x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work/Algorithms/Classification/FineGrained/PI_202204/timm/models/swin_transformer.py", line 413, in forward
    x = blk(x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work/Algorithms/Classification/FineGrained/PI_202204/timm/models/swin_transformer.py", line 280, in forward
    x = x.view(B, H, W, C)
RuntimeError: shape '[1, 96, 96, 192]' is invalid for input of size 11059200

Could you please give me some resolutions that can be supported by the codes? thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants