About the problem of multi-node running stuck #1185

AntyRia · 2023-08-24T07:40:22Z

My machine: 2 machines with different ips and 2 available Gpus on each machine

When I use the multigpu_torchrun.py example, when I pass these two directives:
torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=172.xx.1.150:29603 multi_node_torchrun.py 50 10
and
torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=172.xx.1.150:29603 multi_node_torchrun.py 50 10
When I started, the program got stuck in self.model = DDP(self.model, device_ids=[self.local_rank]) and stopped running, But with nvidia-smi we can see that processes on both machines have been created and are already occupying memory. I wonder why

Looking through the history I was able to find similar issues, saying they involved synchronization deadlocks, but I don't think that was the root cause since I was using the official example.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the problem of multi-node running stuck #1185

About the problem of multi-node running stuck #1185

AntyRia commented Aug 24, 2023 •

edited

About the problem of multi-node running stuck #1185

About the problem of multi-node running stuck #1185

Comments

AntyRia commented Aug 24, 2023 • edited

AntyRia commented Aug 24, 2023 •

edited