Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential issue of “errno: 98- Address already in use” error in DDP (with torchrun) #126510

Open
wakaka9526 opened this issue May 17, 2024 · 0 comments
Assignees
Labels
module: ddp Issues/PRs related distributed data parallel training oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@wakaka9526
Copy link

wakaka9526 commented May 17, 2024

🐛 Describe the bug

During the use of torchrun (with ddp), sometimes there may be random occurrences of ‘errno: 98- Address already in use’, for example:

[W socket.cpp:436] [c10d] The server socket has failed to bind to [::]:29400 (errno: 98 - Address already in use).
[W socket.cpp:436] [c10d] The server socket has failed to bind to 0.0.0.0:29400 (errno: 98 - Address already in use).
[E socket.cpp:472] [c10d] The server socket has failed to listen on any local network address.

and details can see from:
https://discuss.pytorch.org/t/potential-issue-of-errno-98-address-already-in-use-error-in-ddp-with-torchrun/202922

Versions

from torch1.10 ~ torch2.1.0 has been tested, all has the problem.
(the newest torch2.3 has not been tested, but I read the source code, which has no changes. I think the problem still exists in torch2.3)

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

@mikaylagawarecki mikaylagawarecki added oncall: distributed Add this issue/PR to distributed oncall triage queue module: ddp Issues/PRs related distributed data parallel training labels May 20, 2024
@wconstab wconstab added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: ddp Issues/PRs related distributed data parallel training oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants