Potential issue of “errno: 98- Address already in use” error in DDP (with torchrun) #126510
Labels
module: ddp
Issues/PRs related distributed data parallel training
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
🐛 Describe the bug
During the use of torchrun (with ddp), sometimes there may be random occurrences of ‘errno: 98- Address already in use’, for example:
and details can see from:
https://discuss.pytorch.org/t/potential-issue-of-errno-98-address-already-in-use-error-in-ddp-with-torchrun/202922
Versions
from torch1.10 ~ torch2.1.0 has been tested, all has the problem.
(the newest torch2.3 has not been tested, but I read the source code, which has no changes. I think the problem still exists in torch2.3)
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k
The text was updated successfully, but these errors were encountered: