Multiple GPUs Available？ #9

KleinXin · 2022-03-18T17:44:45Z

It seems the codes only support single GPU trainning.
Is it possible to train on multiple GPUs？
thx

chou141253 · 2022-03-23T14:34:42Z

Yes, if you want to use multi-GPUs training, please change the model for parallel computing. https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html

add

model = nn.DataParallel(model)
before

model.to(device)

KleinXin · 2022-03-24T03:04:13Z

model = nn.DataParallel(model)

By following your suggestions, I used two GPUs to train the model swin-vit-p4w12. It gives errors as below

Traceback (most recent call last):
  File "train.py", line 414, in <module>
    train(args, epoch, model, scaler, optimizer, schedule, train_loader, save_distrubution=save_dist)
  File "train.py", line 153, in train
    losses, accuracys = model(datas, labels)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward
    return self.gather(outputs, self.output_device)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 181, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 78, in gather
    res = gather_map(outputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 73, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 70, in gather_map
    for k in out))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 70, in <genexpr>
    for k in out))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 73, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
TypeError: zip argument #1 must support iteration

Maybe Line 35 in train.py should also be changed because two GPUs are used.

def set_environment(args):

    args.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    train_set = ImageDataset(istrain=True, 
                            root=args.train_root,
                            data_size=args.data_size,
                            return_index=True)

Do you have any suggestions？thx

chou141253 · 2022-03-24T06:25:20Z

Oh! We need to revise the model class so you can use the parallel function! We will work on this. Please wait for several days.

KleinXin · 2022-03-24T08:19:37Z

Oh! We need to revise the model class so you can use the parallel function! We will work on this. Please wait for several days.

Waiting for your good news！

chou141253 · 2022-04-22T11:24:48Z

Sorry to keep you waiting. New version support multi-gpus training.

KleinXin · 2022-04-22T13:36:59Z

Sorry to keep you waiting. New version support multi-gpus training.

thx! I will have a try

KleinXin · 2022-05-01T14:17:36Z

Sorry to keep you waiting. New version support multi-gpus training.

Thank you for your work that makes multiple GPU trainning available！I still have two questions.

Could you please give a config file of resnet50？
Is it possible that Swin Transformer supports large resolution image trannning？

I am trying to train the model on a Fine Grained Image Dataset. Larger input resolution may be helpful to the accuracy.

I tried to set data_size to 960，but it gives an error below.

File "main.py", line 297, in <module>
    main(args, tlogger)
  File "main.py", line 249, in main
    train(args, epoch, model, scaler, amp_context, optimizer, schedule, train_loader)
  File "main.py", line 137, in train
    outs = model(datas)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work/Algorithms/Classification/FineGrained/PI_202204/models/pim_module/pim_module.py", line 407, in forward
    x = self.forward_backbone(x)
  File "/work/Algorithms/Classification/FineGrained/PI_202204/models/pim_module/pim_module.py", line 386, in forward_backbone
    return self.backbone(x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work/Algorithms/Classification/FineGrained/PI_202204/timm/models/swin_transformer.py", line 544, in forward
    x = self.forward_features(x)
  File "/work/Algorithms/Classification/FineGrained/PI_202204/timm/models/swin_transformer.py", line 534, in forward_features
    l1 = self.layers[0](x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work/Algorithms/Classification/FineGrained/PI_202204/timm/models/swin_transformer.py", line 413, in forward
    x = blk(x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work/Algorithms/Classification/FineGrained/PI_202204/timm/models/swin_transformer.py", line 280, in forward
    x = x.view(B, H, W, C)
RuntimeError: shape '[1, 96, 96, 192]' is invalid for input of size 11059200

Could you please give me some resolutions that can be supported by the codes? thx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple GPUs Available？ #9

Multiple GPUs Available？ #9

KleinXin commented Mar 18, 2022

chou141253 commented Mar 23, 2022

KleinXin commented Mar 24, 2022 •

edited

chou141253 commented Mar 24, 2022

KleinXin commented Mar 24, 2022

chou141253 commented Apr 22, 2022

KleinXin commented Apr 22, 2022

KleinXin commented May 1, 2022 •

edited

Multiple GPUs Available？ #9

Multiple GPUs Available？ #9

Comments

KleinXin commented Mar 18, 2022

chou141253 commented Mar 23, 2022

KleinXin commented Mar 24, 2022 • edited

chou141253 commented Mar 24, 2022

KleinXin commented Mar 24, 2022

chou141253 commented Apr 22, 2022

KleinXin commented Apr 22, 2022

KleinXin commented May 1, 2022 • edited

KleinXin commented Mar 24, 2022 •

edited

KleinXin commented May 1, 2022 •

edited