Frequently failed in training. #429

C-de-Furina · 2024-04-19T00:14:02Z

I noticed that every time when 3 epochs training were done, the training process failed.

Here you can see the process is still running, and memories are still allocated, but GPUs are actually not working. I'm sure during the time memory-usage is always around 60g but it will show me that:

RuntimeError: CUDA out of memory. Tried to allocate 3.12 GiB (GPU 0; 79.20 GiB total capacity; 10.85 GiB already allocated; 885.31 MiB free; 14.73 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here are my arguments:

    --template_release_dates_cache_path=~~~/openfold/mmcif_cache.json \
    --precision=bf16 \
    --gpus=2 --replace_sampler_ddp=True \
    --seed=42 \
    --deepspeed_config_path=~~~/openfold/deepspeed_config.json \
    --checkpoint_every_epoch \
    --obsolete_pdbs_file_path=~~~/pdb_mmcif/obsolete.dat \
    --max_epochs=100 \
    --train_epoch_len=200\
    --config_preset="model_5_multimer_v3" \
    --num_nodes=1 \
    --train_mmcif_data_cache_path=~~~/openfold/mmcif_cache.json \

eamag · 2024-05-19T10:12:37Z

I see in the readme they recommend to use mixed precision

openfold/docs/source/Training_OpenFold.md

Line 129 in 3c1fd31

    
           - Precision: On A100s, OpenFold training works best with bfloat 16 precision (e.g. `--precision bf16-mixed`)

It looks like a GPU setup problem, did you try the suggestion from the error?

If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequently failed in training. #429

Frequently failed in training. #429

C-de-Furina commented Apr 19, 2024

eamag commented May 19, 2024

Frequently failed in training. #429

Frequently failed in training. #429

Comments

C-de-Furina commented Apr 19, 2024

eamag commented May 19, 2024