Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not launch DDP training using distributed/ddp-tutorial-series/multigpu.py #1199

Open
480284856 opened this issue Nov 10, 2023 · 0 comments

Comments

@480284856
Copy link

I'm in the main branch, and the checkout is also the latest: c67bbab.
my launch command is:

python multigpu.py 30 5

error is below:

[W socket.cpp:663] [c10d] The client socket has failed to connect to [localhost]:12355 (errno: 99 - Cannot assign requested address).
Traceback (most recent call last):
  File "/workspace/distributed/ddp-tutorial-series/multigpu.py", line 104, in <module>
    mp.spawn(main, args=(world_size, args.save_every, args.total_epochs, args.batch_size), nprocs=world_size)
  File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 163, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/workspace/distributed/ddp-tutorial-series/multigpu.py", line 90, in main
    trainer = Trainer(model, train_data, optimizer, rank, save_every)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/distributed/ddp-tutorial-series/multigpu.py", line 38, in __init__
    self.model = DDP(model, device_ids=[gpu_id])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 795, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/utils.py", line 265, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1695392026823/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'invalid argument'

my env:
python 3.11.5 h955ad1f_0
pytorch 2.1.0 py3.11_cuda11.8_cudnn8.7.0_0 pytorch

container launch command:

docker run -itd \
        --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all \
        --shm-size '32g' \
        --volume ${PWD}:/workspace \
        --workdir /workspace \
        --name pytorch_examples \
        nvidia/cuda:11.8.0-devel-ubuntu22.04

Thanks in advance :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant