Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the problem of multi-node running stuck #1185

Open
AntyRia opened this issue Aug 24, 2023 · 0 comments
Open

About the problem of multi-node running stuck #1185

AntyRia opened this issue Aug 24, 2023 · 0 comments

Comments

@AntyRia
Copy link

AntyRia commented Aug 24, 2023

My machine: 2 machines with different ips and 2 available Gpus on each machine

When I use the multigpu_torchrun.py example, when I pass these two directives:
torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=172.xx.1.150:29603 multi_node_torchrun.py 50 10
and
torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=172.xx.1.150:29603 multi_node_torchrun.py 50 10
When I started, the program got stuck in self.model = DDP(self.model, device_ids=[self.local_rank]) and stopped running, But with nvidia-smi we can see that processes on both machines have been created and are already occupying memory. I wonder why

Looking through the history I was able to find similar issues, saying they involved synchronization deadlocks, but I don't think that was the root cause since I was using the official example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant