torch.distributed.DistBackendError when training on multiple GPUs #2635

CytoLiorG · 2024-03-28T11:50:51Z

scvi crashes when trying to train on multiple GPUs (2x Tesla P100-PCIE-16GB)

As attempt to work around Lightning-AI/pytorch-lightning#17212 issue strategy='ddp_find_unused_parameters_true' was set.

def annotate(adata: AnnData, geneset: Dict, out_dir: str = 'out', epochs: int = None, visualize: bool = False, save: bool = False, random_seed=None) -> pd.Series:
    normalized = adata.copy()
    sc.pp.normalize_total(normalized, target_sum=1e4)
    sc.pp.log1p(normalized)
    normalized = normalized[:, geneset['gene_subset']].copy()
    sc.pp.scale(normalized)
    adata.obs["seed_labels"] = generate_seed_labels(normalized, geneset['cell_geneset'])
    base_model_train_ratio, transfer_model_train_ratio = get_train_ratio(adata, unconstrained_train_ratio=0.9)
    torch.set_float32_matmul_precision("medium")
    scvi.model.SCVI.setup_anndata(adata, batch_key=None, labels_key="seed_labels")
    if random_seed is not None:
        scvi.settings.seed = random_seed
    scvi_model = scvi.model.SCVI(adata, n_latent=30, n_layers=2)
    scvi_model.train(max_epochs=epochs, train_size=base_model_train_ratio, accelerator='gpu', devices=-1, strategy='ddp_find_unused_parameters_true')

    scanvi_model = scvi.model.SCANVI.from_scvi_model(scvi_model, 'unknown')
    print('Training transfer model:')
    scanvi_model.train(max_epochs=epochs, train_size=transfer_model_train_ratio, accelerator='gpu', devices=-1, strategy='ddp_find_unused_parameters_true'
    )

Training transfer model:

                                             
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

Training:   0%|          | 0/100 [00:00<?, ?it/s]
Epoch 1/100:   0%|          | 0/100 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.

Training:   0%|          | 0/100 [00:00<?, ?it/s]
Epoch 1/100:   0%|          | 0/100 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/scvi/module/_scanvae.py:304: UserWarning: The value argument must be within the support of the distribution
  reconst_loss = -px.log_prob(x).sum(-1)
/usr/local/lib/python3.10/dist-packages/scvi/module/_scanvae.py:304: UserWarning: The value argument must be within the support of the distribution
  reconst_loss = -px.log_prob(x).sum(-1)
/usr/local/lib/python3.10/dist-packages/scvi/module/_scanvae.py:304: UserWarning: The value argument must be within the support of the distribution
  reconst_loss = -px.log_prob(x).sum(-1)

Epoch 1/100:   1%|          | 1/100 [00:15<26:13, 15.89s/it]/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:433: It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.

Epoch 1/100:   1%|          | 1/100 [00:18<30:16, 18.35s/it][rank1]:[W socket.cpp:432] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
Traceback (most recent call last):
  File "/work/demo.py", line 8, in <module>
    run(args)
  File "/work/src/main.py", line 40, in run
    pred = annotate(
  File "/work/src/annotate.py", line 181, in annotate
    scanvi_model.train(
  File "/usr/local/lib/python3.10/dist-packages/scvi/model/_scanvi.py", line 438, in train
    return runner()
  File "/usr/local/lib/python3.10/dist-packages/scvi/train/_trainrunner.py", line 98, in __call__
    self.trainer.fit(self.training_plan, self.data_splitter)
  File "/usr/local/lib/python3.10/dist-packages/scvi/train/_trainer.py", line 219, in fit
    super().fit(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage
    self.fit_loop.run()
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 203, in run
    self.on_advance_end()
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 372, in on_advance_end
    call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=False)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 208, in _call_callback_hooks
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/scvi/train/_progress.py", line 85, in on_train_epoch_end
    self.main_progress_bar.set_postfix(self.get_metrics(trainer, pl_module))
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/progress/progress_bar.py", line 195, in get_metrics
    pbar_metrics = trainer.progress_bar_metrics
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1653, in progress_bar_metrics
    return self._logger_connector.progress_bar_metrics
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 245, in progress_bar_metrics
    metrics = self.metrics["pbar"]
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 226, in metrics
    return self.trainer._results.metrics(on_step)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 475, in metrics
    value = self._get_cache(result_metric, on_step)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 439, in _get_cache
    result_metric.compute()
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 284, in wrapped_func
    self._computed = compute(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 252, in compute
    return self.value.compute()
  File "/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py", line 1135, in compute
    val_a = self.metric_a.compute() if isinstance(self.metric_a, Metric) else self.metric_a
  File "/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py", line 1135, in compute
    val_a = self.metric_a.compute() if isinstance(self.metric_a, Metric) else self.metric_a
  File "/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py", line 610, in wrapped_func
    with self.sync_context(
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py", line 581, in sync_context
    self.sync(
  File "/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py", line 530, in sync
    self._sync_dist(dist_sync_fn, process_group=process_group)
  File "/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py", line 434, in _sync_dist
    output_dict = apply_to_collection(
  File "/usr/local/lib/python3.10/dist-packages/lightning_utilities/core/apply_func.py", line 70, in apply_to_collection
    return {k: function(v, *args, **kwargs) for k, v in data.items()}
  File "/usr/local/lib/python3.10/dist-packages/lightning_utilities/core/apply_func.py", line 70, in <dictcomp>
    return {k: function(v, *args, **kwargs) for k, v in data.items()}
  File "/usr/local/lib/python3.10/dist-packages/torchmetrics/utilities/distributed.py", line 122, in gather_all_tensors
    return _simple_gather_all_tensors(result, group, world_size)
  File "/usr/local/lib/python3.10/dist-packages/torchmetrics/utilities/distributed.py", line 93, in _simple_gather_all_tensors
    torch.distributed.all_gather(gathered_result, result, group)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2617, in all_gather
    work = group.allgather([tensor_list], [tensor])
torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout
Exception raised from doWait at ../torch/csrc/distributed/c10d/TCPStore.cpp:550 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fab06181d87 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x15c0e57 (0x7faaee93fe57 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7faaf2c0ece2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7faaf2c0fb11 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7faaf2bc4f81 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7faaf2bc4f81 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7faaf2bc4f81 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7faabbe02c69 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #8: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x22b (0x7faabbe09c5b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::allgather(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllgatherOptions const&) + 0xb5c (0x7faabbe2005c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #10: <unknown function> + 0x583a31d (0x7faaf2bb931d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x5844218 (0x7faaf2bc3218 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x4e893cc (0x7faaf22083cc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x1a08a88 (0x7faaeed87a88 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x584ba33 (0x7faaf2bcaa33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x5856e1f (0x7faaf2bd5e1f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0xca3fae (0x7fab0545afae in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #17: <unknown function> + 0x413ea4 (0x7fab04bcaea4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #18: <unknown function> + 0x15a10e (0x59d15614610e in /usr/bin/python3)
frame #19: _PyObject_MakeTpCall + 0x25b (0x59d15613ca7b in /usr/bin/python3)
frame #20: <unknown function> + 0x168acb (0x59d156154acb in /usr/bin/python3)
frame #21: _PyEval_EvalFrameDefault + 0x614a (0x59d156134cfa in /usr/bin/python3)
frame #22: _PyFunction_Vectorcall + 0x7c (0x59d1561469fc in /usr/bin/python3)
frame #23: _PyEval_EvalFrameDefault + 0x2a27 (0x59d1561315d7 in /usr/bin/python3)
frame #24: _PyFunction_Vectorcall + 0x7c (0x59d1561469fc in /usr/bin/python3)
frame #25: _PyEval_EvalFrameDefault + 0x614a (0x59d156134cfa in /usr/bin/python3)
frame #26: _PyFunction_Vectorcall + 0x7c (0x59d1561469fc in /usr/bin/python3)
frame #27: _PyEval_EvalFrameDefault + 0x6bd (0x59d15612f26d in /usr/bin/python3)
frame #28: _PyFunction_Vectorcall + 0x7c (0x59d1561469fc in /usr/bin/python3)
frame #29: PyObject_Call + 0x122 (0x59d156155492 in /usr/bin/python3)
frame #30: _PyEval_EvalFrameDefault + 0x2a27 (0x59d1561315d7 in /usr/bin/python3)
frame #31: _PyFunction_Vectorcall + 0x7c (0x59d1561469fc in /usr/bin/python3)
frame #32: _PyEval_EvalFrameDefault + 0x6bd (0x59d15612f26d in /usr/bin/python3)
frame #33: _PyFunction_Vectorcall + 0x7c (0x59d1561469fc in /usr/bin/python3)
frame #34: _PyEval_EvalFrameDefault + 0x198c (0x59d15613053c in /usr/bin/python3)
frame #35: <unknown function> + 0x1687f1 (0x59d1561547f1 in /usr/bin/python3)
frame #36: _PyEval_EvalFrameDefault + 0x198c (0x59d15613053c in /usr/bin/python3)
frame #37: <unknown function> + 0x1687f1 (0x59d1561547f1 in /usr/bin/python3)
frame #38: _PyEval_EvalFrameDefault + 0x198c (0x59d15613053c in /usr/bin/python3)
frame #39: <unknown function> + 0x200175 (0x59d1561ec175 in /usr/bin/python3)
frame #40: <unknown function> + 0x15ac59 (0x59d156146c59 in /usr/bin/python3)
frame #41: _PyEval_EvalFrameDefault + 0x6bd (0x59d15612f26d in /usr/bin/python3)
frame #42: <unknown function> + 0x168a51 (0x59d156154a51 in /usr/bin/python3)
frame #43: _PyEval_EvalFrameDefault + 0x266d (0x59d15613121d in /usr/bin/python3)
frame #44: _PyFunction_Vectorcall + 0x7c (0x59d1561469fc in /usr/bin/python3)
frame #45: _PyEval_EvalFrameDefault + 0x614a (0x59d156134cfa in /usr/bin/python3)
frame #46: <unknown function> + 0x1687f1 (0x59d1561547f1 in /usr/bin/python3)
frame #47: _PyEval_EvalFrameDefault + 0x614a (0x59d156134cfa in /usr/bin/python3)
frame #48: <unknown function> + 0x1687f1 (0x59d1561547f1 in /usr/bin/python3)
frame #49: _PyEval_EvalFrameDefault + 0x614a (0x59d156134cfa in /usr/bin/python3)
frame #50: <unknown function> + 0x168a51 (0x59d156154a51 in /usr/bin/python3)
frame #51: _PyEval_EvalFrameDefault + 0x2a27 (0x59d1561315d7 in /usr/bin/python3)
frame #52: _PyFunction_Vectorcall + 0x7c (0x59d1561469fc in /usr/bin/python3)
frame #53: _PyEval_EvalFrameDefault + 0x614a (0x59d156134cfa in /usr/bin/python3)
frame #54: _PyFunction_Vectorcall + 0x7c (0x59d1561469fc in /usr/bin/python3)
frame #55: _PyEval_EvalFrameDefault + 0x614a (0x59d156134cfa in /usr/bin/python3)
frame #56: _PyFunction_Vectorcall + 0x7c (0x59d1561469fc in /usr/bin/python3)
frame #57: _PyEval_EvalFrameDefault + 0x8ac (0x59d15612f45c in /usr/bin/python3)
frame #58: <unknown function> + 0x16600e (0x59d15615200e in /usr/bin/python3)
frame #59: _PyObject_GenericGetAttrWithDict + 0x468 (0x59d1561448a8 in /usr/bin/python3)
frame #60: PyObject_GetAttr + 0x4d (0x59d156142e3d in /usr/bin/python3)
frame #61: _PyEval_EvalFrameDefault + 0x5dc1 (0x59d156134971 in /usr/bin/python3)
frame #62: <unknown function> + 0x16600e (0x59d15615200e in /usr/bin/python3)
frame #63: _PyObject_GenericGetAttrWithDict + 0x468 (0x59d1561448a8 in /usr/bin/python3)
. This may indicate a possible application crash on rank 0 or a network set up issue.

Epoch 1/100:   1%|          | 1/100 [30:19<50:01:35, 1819.15s/it]
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=115017, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=1800000) ran for 1800859 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=115017, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=1800000) ran for 1800859 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7d37dfb81d87 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7d37958026e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7d3795805c3d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7d3795806839 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdc253 (0x7d38071b9253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7d381d459ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126a40 (0x7d381d4eba40 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=115017, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=1800000) ran for 1800859 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7d37dfb81d87 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7d37958026e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7d3795805c3d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7d3795806839 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdc253 (0x7d38071b9253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7d381d459ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126a40 (0x7d381d4eba40 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7d37dfb81d87 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xdf6b11 (0x7d379555cb11 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xdc253 (0x7d38071b9253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x94ac3 (0x7d381d459ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126a40 (0x7d381d4eba40 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Versions:

cuda 11.8
scvi-tools 1.1.1
jaxlib 0.4.23+cuda11.cudnn86

The text was updated successfully, but these errors were encountered:

CytoLiorG added the bug label Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.distributed.DistBackendError when training on multiple GPUs #2635

torch.distributed.DistBackendError when training on multiple GPUs #2635

CytoLiorG commented Mar 28, 2024

torch.distributed.DistBackendError when training on multiple GPUs #2635

torch.distributed.DistBackendError when training on multiple GPUs #2635

Comments

CytoLiorG commented Mar 28, 2024

Versions: