You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that every time when 3 epochs training were done, the training process failed.
Here you can see the process is still running, and memories are still allocated, but GPUs are actually not working. I'm sure during the time memory-usage is always around 60g but it will show me that:
RuntimeError: CUDA out of memory. Tried to allocate 3.12 GiB (GPU 0; 79.20 GiB total capacity; 10.85 GiB already allocated; 885.31 MiB free; 14.73 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
- Precision: On A100s, OpenFold training works best with bfloat 16 precision (e.g. `--precision bf16-mixed`)
It looks like a GPU setup problem, did you try the suggestion from the error?
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I noticed that every time when 3 epochs training were done, the training process failed.
Here you can see the process is still running, and memories are still allocated, but GPUs are actually not working. I'm sure during the time memory-usage is always around 60g but it will show me that:
RuntimeError: CUDA out of memory. Tried to allocate 3.12 GiB (GPU 0; 79.20 GiB total capacity; 10.85 GiB already allocated; 885.31 MiB free; 14.73 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Here are my arguments:
The text was updated successfully, but these errors were encountered: