You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My machine: 2 machines with different ips and 2 available Gpus on each machine
When I use the multigpu_torchrun.py example, when I pass these two directives: torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=172.xx.1.150:29603 multi_node_torchrun.py 50 10
and torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=172.xx.1.150:29603 multi_node_torchrun.py 50 10
When I started, the program got stuck in self.model = DDP(self.model, device_ids=[self.local_rank]) and stopped running, But with nvidia-smi we can see that processes on both machines have been created and are already occupying memory. I wonder why
Looking through the history I was able to find similar issues, saying they involved synchronization deadlocks, but I don't think that was the root cause since I was using the official example.
The text was updated successfully, but these errors were encountered:
My machine: 2 machines with different ips and 2 available Gpus on each machine
When I use the multigpu_torchrun.py example, when I pass these two directives:
torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=172.xx.1.150:29603 multi_node_torchrun.py 50 10
and
torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=172.xx.1.150:29603 multi_node_torchrun.py 50 10
When I started, the program got stuck in
self.model = DDP(self.model, device_ids=[self.local_rank])
and stopped running, But withnvidia-smi
we can see that processes on both machines have been created and are already occupying memory. I wonder whyLooking through the history I was able to find similar issues, saying they involved synchronization deadlocks, but I don't think that was the root cause since I was using the official example.
The text was updated successfully, but these errors were encountered: