Frequently failed in training. #429

C-de-Furina · 2024-04-19T00:14:02Z

I noticed that every time when 3 epochs training were done, the training process failed.

Here you can see the process is still running, and memories are still allocated, but GPUs are actually not working. I'm sure during the time memory-usage is always around 60g but it will show me that:

RuntimeError: CUDA out of memory. Tried to allocate 3.12 GiB (GPU 0; 79.20 GiB total capacity; 10.85 GiB already allocated; 885.31 MiB free; 14.73 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here are my arguments:

    --template_release_dates_cache_path=~~~/openfold/mmcif_cache.json \
    --precision=bf16 \
    --gpus=2 --replace_sampler_ddp=True \
    --seed=42 \
    --deepspeed_config_path=~~~/openfold/deepspeed_config.json \
    --checkpoint_every_epoch \
    --obsolete_pdbs_file_path=~~~/pdb_mmcif/obsolete.dat \
    --max_epochs=100 \
    --train_epoch_len=200\
    --config_preset="model_5_multimer_v3" \
    --num_nodes=1 \
    --train_mmcif_data_cache_path=~~~/openfold/mmcif_cache.json \

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequently failed in training. #429

Frequently failed in training. #429

C-de-Furina commented Apr 19, 2024

Frequently failed in training. #429

Frequently failed in training. #429

Comments

C-de-Furina commented Apr 19, 2024