Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi node training #126

Open
drhicks opened this issue Aug 18, 2023 · 3 comments
Open

Multi node training #126

drhicks opened this issue Aug 18, 2023 · 3 comments

Comments

@drhicks
Copy link

drhicks commented Aug 18, 2023

I was able to successfully train multimer on a singe node gpu with multiple gpus, but I have been having trouble modifying the training example to train on multiple nodes. Would it be possible to provide an example for multi node training?

I'm not sure how to properly modify the torchrun command or unicore arguments.

@ZiyaoLi
Copy link
Member

ZiyaoLi commented Aug 25, 2023

Can you please provide more info on the code you are using and the failure message? The code is expected to solve multi-node training with torch.distributional, so to me it seems like a configuration problem of your distributional framework.

@drhicks
Copy link
Author

drhicks commented Aug 26, 2023

Thanks for the help. I am sure I am just doing something stupid here.

I submit to slurm like this (everything works with single node):

sbatch -p gpu-train --gres=gpu:l40:8 -N 2 -c4 -n8 -t 99:00:00 --wrap='source activate unifold; cd /home/drhicks1/Uni-Fold; bash train_multimer.sh /databases/unifold/ multimer_unifold_ft params/multimer.unifold.pt multimer'

The training script is the same as the examples, except for these changes:
MASTER_IP=$(hostname -I | awk '{print $1}')
OMPI_COMM_WORLD_SIZE=$SLURM_NNODES
OMPI_COMM_WORLD_RANK=$SLURM_NODEID

code:

`[ -z "${MASTER_PORT}" ] && MASTER_PORT=10087
[ -z "${MASTER_IP}" ] && MASTER_IP=$(hostname -I | awk '{print $1}')
[ -z "${n_gpu}" ] && n_gpu=$(nvidia-smi -L | wc -l)
[ -z "${update_freq}" ] && update_freq=1
[ -z "${total_step}" ] && total_step=10000
[ -z "${warmup_step}" ] && warmup_step=500
[ -z "${decay_step}" ] && decay_step=10000
[ -z "${decay_ratio}" ] && decay_ratio=1.0
[ -z "${sd_prob}" ] && sd_prob=0.5
[ -z "${lr}" ] && lr=5e-4
[ -z "${seed}" ] && seed=31
[ -z "${OMPI_COMM_WORLD_SIZE}" ] && OMPI_COMM_WORLD_SIZE=$SLURM_NNODES
[ -z "${OMPI_COMM_WORLD_RANK}" ] && OMPI_COMM_WORLD_RANK=$SLURM_NODEID

export NCCL_ASYNC_ERROR_HANDLING=1
export OMP_NUM_THREADS=1
echo "n_gpu per node" $n_gpu
echo "OMPI_COMM_WORLD_SIZE" $OMPI_COMM_WORLD_SIZE
echo "OMPI_COMM_WORLD_RANK" $OMPI_COMM_WORLD_RANK
echo "MASTER_IP" $MASTER_IP
echo "MASTER_PORT" $MASTER_PORT
echo "data" $1
echo "save_dir" $2
echo "decay_step" $decay_step
echo "warmup_step" $warmup_step
echo "decay_ratio" $decay_ratio
echo "lr" $lr
echo "total_step" $total_step
echo "update_freq" $update_freq
echo "seed" $seed
echo "data_folder:"
ls $1
echo "create folder for save"
mkdir -p $2
echo "start training"

OPTION=""
if [ -f "$2/checkpoint_last.pt" ]; then
echo "ckp exists."
else
echo "finetuning from inital training..."
OPTION=" --finetune-from-model $3 --load-from-ema "
fi
model_name=$4

tmp_dir=mktemp -d

torchrun --nproc_per_node=$n_gpu --master_port $MASTER_PORT --nnodes=$OMPI_COMM_WORLD_SIZE --node_rank=$OMPI_COMM_WORLD_RANK --master_addr=$MASTER_IP
$(which unicore-train) $1 --user-dir unifold
--num-workers 4 --ddp-backend=no_c10d
--task af2 --loss afm --arch af2 --sd-prob $sd_prob
--optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-6 --clip-norm 0.0 --per-sample-clip-norm 0.1 --allreduce-fp32-grad
--lr-scheduler exponential_decay --lr $lr --warmup-updates $warmup_step --decay-ratio $decay_ratio --decay-steps $decay_step --stair-decay --batch-size 1
--update-freq $update_freq --seed $seed --tensorboard-logdir $2/tsb/
--max-update $total_step --max-epoch 1 --log-interval 10 --log-format simple
--save-interval-updates 500 --validate-interval-updates 500 --keep-interval-updates 40 --no-epoch-checkpoints
--save-dir $2 --tmp-save-dir $tmp_dir --required-batch-size-multiple 1 --bf16 --ema-decay 0.999 --data-buffer-size 32 --bf16-sr --model-name $model_name $OPTION

rm -rf $tmp_dir`

Below is log output. It just hangs forever after this:

n_gpu per node 8
OMPI_COMM_WORLD_SIZE 2
OMPI_COMM_WORLD_RANK 0
MASTER_IP 172.16.130.196
MASTER_PORT 10087
data /databases/openfold/unifold/
save_dir multimer_unifold_ft3
decay_step 10000
warmup_step 500
decay_ratio 1.0
lr 5e-4
total_step 10000
update_freq 1
seed 31
data_folder:
eval_multi_label.json
eval_sample_weight.json
pdb_assembly.json
pdb_features
pdb_labels
pdb_uniprots
sd_features
sd_labels
sd_train_sample_weight.json
train_multi_label.json
train_sample_weight.json
create folder for save
start training
finetuning from inital training...

@jozhang97
Copy link

I think the master addr/ip is not set properly, each node sets itself as the master.

see https://discuss.pytorch.org/t/distributed-training-on-slurm-cluster/150417/8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants