Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequently failed in training. #429

Open
C-de-Furina opened this issue Apr 19, 2024 · 0 comments
Open

Frequently failed in training. #429

C-de-Furina opened this issue Apr 19, 2024 · 0 comments

Comments

@C-de-Furina
Copy link

NZQDJ(QMIV%R}0_J @XW`S

I noticed that every time when 3 epochs training were done, the training process failed.

72N8($L1YF07_V_WZM~64FN

Here you can see the process is still running, and memories are still allocated, but GPUs are actually not working. I'm sure during the time memory-usage is always around 60g but it will show me that:

RuntimeError: CUDA out of memory. Tried to allocate 3.12 GiB (GPU 0; 79.20 GiB total capacity; 10.85 GiB already allocated; 885.31 MiB free; 14.73 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here are my arguments:

    --template_release_dates_cache_path=~~~/openfold/mmcif_cache.json \
    --precision=bf16 \
    --gpus=2 --replace_sampler_ddp=True \
    --seed=42 \
    --deepspeed_config_path=~~~/openfold/deepspeed_config.json \
    --checkpoint_every_epoch \
    --obsolete_pdbs_file_path=~~~/pdb_mmcif/obsolete.dat \
    --max_epochs=100 \
    --train_epoch_len=200\
    --config_preset="model_5_multimer_v3" \
    --num_nodes=1 \
    --train_mmcif_data_cache_path=~~~/openfold/mmcif_cache.json \
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant