an NCCL timeout and the error "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory" #1284

jessiewy · 2024-05-13T03:10:36Z

Does an NCCL timeout will lead the error "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory"?

kiskra-nvidia · 2024-05-13T23:03:43Z

It could be of course that a timeout error from NCCL results in cascading errors from the application (the error message you quoted does not appear to be generated by NCCL). Are you actually seeing a NCCL timeout or are you merely speculating as to the cause? Running with NCCL_DEBUG=WARN might shed some light if there's a NCCL issue...

jessiewy · 2024-05-16T06:33:19Z

This is logs of a real training task and i seached a similar issue https://discuss.pytorch.org/t/saving-state-dict-with-optimizer-state-sharding/170324

kiskra-nvidia · 2024-05-16T20:42:36Z

Thank you for the additional info.

So it looks like PyTorch times out on a NCCL broadcast operation (after 1800s), and that appears to be followed by an out-of-memory message. My recommendation is to ignore the out-of-memory message for now, as it is most likely a bug in the application or the framework, triggered by the timeout. So the timeout is the one to focus on.

But we can't make any progress on the timeout without seeing the actual error messages from NCCL. For that, the application will need to be run with the NCCL_DEBUG=WARN environment variable set, which results in more verbose output when NCCL detects an error. We'll need to see any such messages. If there's nothing (i.e., NCCL does not detect an error), you'll need to run with NCCL_DEBUG=INFO, which prints additional debug information during initialization. Hopefully it won't be too much in your case (what scale was this run at?).

Also, we don't know anything about the machine you are running on, the NCCL version used, any NCCL tweaks being applied, etc.

jessiewy · 2024-05-17T05:50:56Z

Thank your quick response. In this case we found out the reason of the timeout, just want to know the relationship between nccl-timeout and out-of-memory. Actually you have almost answered some of my questions. Thank you again.

jessiewy · 2024-05-17T05:51:06Z

jessiewy closed this as completed May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

an NCCL timeout and the error "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory" #1284

an NCCL timeout and the error "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory" #1284

jessiewy commented May 13, 2024

kiskra-nvidia commented May 13, 2024

jessiewy commented May 16, 2024

kiskra-nvidia commented May 16, 2024

jessiewy commented May 17, 2024

jessiewy commented May 17, 2024

an NCCL timeout and the error "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory" #1284

an NCCL timeout and the error "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory" #1284

Comments

jessiewy commented May 13, 2024

kiskra-nvidia commented May 13, 2024

jessiewy commented May 16, 2024

kiskra-nvidia commented May 16, 2024

jessiewy commented May 17, 2024

jessiewy commented May 17, 2024