-
Notifications
You must be signed in to change notification settings - Fork 736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
an NCCL timeout and the error "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory" #1284
Comments
It could be of course that a timeout error from NCCL results in cascading errors from the application (the error message you quoted does not appear to be generated by NCCL). Are you actually seeing a NCCL timeout or are you merely speculating as to the cause? Running with |
This is logs of a real training task and i seached a similar issue https://discuss.pytorch.org/t/saving-state-dict-with-optimizer-state-sharding/170324 |
Thank you for the additional info. So it looks like PyTorch times out on a NCCL broadcast operation (after 1800s), and that appears to be followed by an out-of-memory message. My recommendation is to ignore the out-of-memory message for now, as it is most likely a bug in the application or the framework, triggered by the timeout. So the timeout is the one to focus on. But we can't make any progress on the timeout without seeing the actual error messages from NCCL. For that, the application will need to be run with the Also, we don't know anything about the machine you are running on, the NCCL version used, any NCCL tweaks being applied, etc. |
Thank your quick response. In this case we found out the reason of the timeout, just want to know the relationship between nccl-timeout and out-of-memory. Actually you have almost answered some of my questions. Thank you again. |
Does an NCCL timeout will lead the error "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory"?
The text was updated successfully, but these errors were encountered: