Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DistributedHelper CI is failing #1501

Open
AntonioCarta opened this issue Sep 18, 2023 · 2 comments
Open

DistributedHelper CI is failing #1501

AntonioCarta opened this issue Sep 18, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@AntonioCarta
Copy link
Collaborator

@lrzpellegrini check this error.

@AntonioCarta AntonioCarta added the bug Something isn't working label Sep 18, 2023
@lrzpellegrini
Copy link
Collaborator

Hi Antonio, it seems a spurious error. Alas, when shutting down a process that used the PyTorch distributed training api + unittest, the process closing procedures may raise an error. This is not an error of the distributed training helper.

One easy way to circumvent this it to re-run the unit tests that raised an error again. I think I can adapt https://github.com/ContinualAI/avalanche/blob/master/tests/run_dist_tests.py to achieve this.

@AntonioCarta
Copy link
Collaborator Author

Maybe we should run those tests without unittest?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants