Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training will get stuck and stop without reporting an error #37

Open
YooWang opened this issue Apr 27, 2023 · 3 comments
Open

Training will get stuck and stop without reporting an error #37

YooWang opened this issue Apr 27, 2023 · 3 comments

Comments

@YooWang
Copy link

YooWang commented Apr 27, 2023

I set deterministic to be False, and it can run successfully. But when it runs to about 68% of epoch=1, the training will get stuck and stop without reporting an error, and it will not move. How can I solve this?

@RetroCirce
Copy link
Owner

RetroCirce commented May 1, 2023

Did you try to use a single GPU for training and testing first? Setting deterministic does not cause the stuck. I once met this problem before but soon I updated pytorch lightening and it got fixed. A possible problem may lie in the mutli-Gpu training stage when GPUs stuck with each other for waiting the sync.

@YooWang
Copy link
Author

YooWang commented May 2, 2023

Thank you for your reply. I will try to test with a single gpu. By the way, what version of cuda, pytorch, pytorch-lighting did you finally use?

@RetroCirce
Copy link
Owner

I use pytorch_lightning==1.5.9, Cuda=10.1. But I think now the new version of pytorch lightning is also working, just need a few tweaking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants