Training will get stuck and stop without reporting an error #37

YooWang · 2023-04-27T15:30:00Z

I set deterministic to be False, and it can run successfully. But when it runs to about 68% of epoch=1, the training will get stuck and stop without reporting an error, and it will not move. How can I solve this?

RetroCirce · 2023-05-01T23:11:09Z

Did you try to use a single GPU for training and testing first? Setting deterministic does not cause the stuck. I once met this problem before but soon I updated pytorch lightening and it got fixed. A possible problem may lie in the mutli-Gpu training stage when GPUs stuck with each other for waiting the sync.

YooWang · 2023-05-02T06:29:37Z

Thank you for your reply. I will try to test with a single gpu. By the way, what version of cuda, pytorch, pytorch-lighting did you finally use?

RetroCirce · 2023-05-07T01:34:42Z

I use pytorch_lightning==1.5.9, Cuda=10.1. But I think now the new version of pytorch lightning is also working, just need a few tweaking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training will get stuck and stop without reporting an error #37

Training will get stuck and stop without reporting an error #37

YooWang commented Apr 27, 2023

RetroCirce commented May 1, 2023 •

edited

YooWang commented May 2, 2023

RetroCirce commented May 7, 2023

Training will get stuck and stop without reporting an error #37

Training will get stuck and stop without reporting an error #37

Comments

YooWang commented Apr 27, 2023

RetroCirce commented May 1, 2023 • edited

YooWang commented May 2, 2023

RetroCirce commented May 7, 2023

RetroCirce commented May 1, 2023 •

edited