Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

an NCCL timeout and the error "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory" #1284

Closed
jessiewy opened this issue May 13, 2024 · 5 comments

Comments

@jessiewy
Copy link

Does an NCCL timeout will lead the error "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory"?

@kiskra-nvidia
Copy link
Member

It could be of course that a timeout error from NCCL results in cascading errors from the application (the error message you quoted does not appear to be generated by NCCL). Are you actually seeing a NCCL timeout or are you merely speculating as to the cause? Running with NCCL_DEBUG=WARN might shed some light if there's a NCCL issue...

@jessiewy
Copy link
Author

This is logs of a real training task and i seached a similar issue https://discuss.pytorch.org/t/saving-state-dict-with-optimizer-state-sharding/170324
image

@kiskra-nvidia
Copy link
Member

Thank you for the additional info.

So it looks like PyTorch times out on a NCCL broadcast operation (after 1800s), and that appears to be followed by an out-of-memory message. My recommendation is to ignore the out-of-memory message for now, as it is most likely a bug in the application or the framework, triggered by the timeout. So the timeout is the one to focus on.

But we can't make any progress on the timeout without seeing the actual error messages from NCCL. For that, the application will need to be run with the NCCL_DEBUG=WARN environment variable set, which results in more verbose output when NCCL detects an error. We'll need to see any such messages. If there's nothing (i.e., NCCL does not detect an error), you'll need to run with NCCL_DEBUG=INFO, which prints additional debug information during initialization. Hopefully it won't be too much in your case (what scale was this run at?).

Also, we don't know anything about the machine you are running on, the NCCL version used, any NCCL tweaks being applied, etc.

@jessiewy
Copy link
Author

Thank your quick response. In this case we found out the reason of the timeout, just want to know the relationship between nccl-timeout and out-of-memory. Actually you have almost answered some of my questions. Thank you again.

@jessiewy
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants