Multi node training #126

drhicks · 2023-08-18T20:53:10Z

I was able to successfully train multimer on a singe node gpu with multiple gpus, but I have been having trouble modifying the training example to train on multiple nodes. Would it be possible to provide an example for multi node training?

I'm not sure how to properly modify the torchrun command or unicore arguments.

ZiyaoLi · 2023-08-25T07:49:45Z

Can you please provide more info on the code you are using and the failure message? The code is expected to solve multi-node training with torch.distributional, so to me it seems like a configuration problem of your distributional framework.

drhicks · 2023-08-26T23:04:33Z

Thanks for the help. I am sure I am just doing something stupid here.

I submit to slurm like this (everything works with single node):

sbatch -p gpu-train --gres=gpu:l40:8 -N 2 -c4 -n8 -t 99:00:00 --wrap='source activate unifold; cd /home/drhicks1/Uni-Fold; bash train_multimer.sh /databases/unifold/ multimer_unifold_ft params/multimer.unifold.pt multimer'

The training script is the same as the examples, except for these changes:
MASTER_IP=$(hostname -I | awk '{print $1}')
OMPI_COMM_WORLD_SIZE=$SLURM_NNODES
OMPI_COMM_WORLD_RANK=$SLURM_NODEID

code:

`[ -z "${MASTER_PORT}" ] && MASTER_PORT=10087
[ -z "${MASTER_IP}" ] && MASTER_IP=$(hostname -I | awk '{print $1}')
[ -z "${n_gpu}" ] && n_gpu=$(nvidia-smi -L | wc -l)
[ -z "${update_freq}" ] && update_freq=1
[ -z "${total_step}" ] && total_step=10000
[ -z "${warmup_step}" ] && warmup_step=500
[ -z "${decay_step}" ] && decay_step=10000
[ -z "${decay_ratio}" ] && decay_ratio=1.0
[ -z "${sd_prob}" ] && sd_prob=0.5
[ -z "${lr}" ] && lr=5e-4
[ -z "${seed}" ] && seed=31
[ -z "${OMPI_COMM_WORLD_SIZE}" ] && OMPI_COMM_WORLD_SIZE=$SLURM_NNODES
[ -z "${OMPI_COMM_WORLD_RANK}" ] && OMPI_COMM_WORLD_RANK=$SLURM_NODEID

export NCCL_ASYNC_ERROR_HANDLING=1
export OMP_NUM_THREADS=1
echo "n_gpu per node" $n_gpu
echo "OMPI_COMM_WORLD_SIZE" $OMPI_COMM_WORLD_SIZE
echo "OMPI_COMM_WORLD_RANK" $OMPI_COMM_WORLD_RANK
echo "MASTER_IP" $MASTER_IP
echo "MASTER_PORT" $MASTER_PORT
echo "data" $1
echo "save_dir" $2
echo "decay_step" $decay_step
echo "warmup_step" $warmup_step
echo "decay_ratio" $decay_ratio
echo "lr" $lr
echo "total_step" $total_step
echo "update_freq" $update_freq
echo "seed" $seed
echo "data_folder:"
ls $1
echo "create folder for save"
mkdir -p $2
echo "start training"

OPTION=""
if [ -f "$2/checkpoint_last.pt" ]; then
echo "ckp exists."
else
echo "finetuning from inital training..."
OPTION=" --finetune-from-model $3 --load-from-ema "
fi
model_name=$4

tmp_dir=mktemp -d

torchrun --nproc_per_node=$n_gpu --master_port $MASTER_PORT --nnodes=$OMPI_COMM_WORLD_SIZE --node_rank=$OMPI_COMM_WORLD_RANK --master_addr=$MASTER_IP
$(which unicore-train) $1 --user-dir unifold
--num-workers 4 --ddp-backend=no_c10d
--task af2 --loss afm --arch af2 --sd-prob $sd_prob
--optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-6 --clip-norm 0.0 --per-sample-clip-norm 0.1 --allreduce-fp32-grad
--lr-scheduler exponential_decay --lr $lr --warmup-updates $warmup_step --decay-ratio $decay_ratio --decay-steps $decay_step --stair-decay --batch-size 1
--update-freq $update_freq --seed $seed --tensorboard-logdir $2/tsb/
--max-update $total_step --max-epoch 1 --log-interval 10 --log-format simple
--save-interval-updates 500 --validate-interval-updates 500 --keep-interval-updates 40 --no-epoch-checkpoints
--save-dir $2 --tmp-save-dir $tmp_dir --required-batch-size-multiple 1 --bf16 --ema-decay 0.999 --data-buffer-size 32 --bf16-sr --model-name $model_name $OPTION

rm -rf $tmp_dir`

Below is log output. It just hangs forever after this:

n_gpu per node 8
OMPI_COMM_WORLD_SIZE 2
OMPI_COMM_WORLD_RANK 0
MASTER_IP 172.16.130.196
MASTER_PORT 10087
data /databases/openfold/unifold/
save_dir multimer_unifold_ft3
decay_step 10000
warmup_step 500
decay_ratio 1.0
lr 5e-4
total_step 10000
update_freq 1
seed 31
data_folder:
eval_multi_label.json
eval_sample_weight.json
pdb_assembly.json
pdb_features
pdb_labels
pdb_uniprots
sd_features
sd_labels
sd_train_sample_weight.json
train_multi_label.json
train_sample_weight.json
create folder for save
start training
finetuning from inital training...

jozhang97 · 2024-03-19T04:05:46Z

I think the master addr/ip is not set properly, each node sets itself as the master.

see https://discuss.pytorch.org/t/distributed-training-on-slurm-cluster/150417/8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi node training #126

Multi node training #126

drhicks commented Aug 18, 2023

ZiyaoLi commented Aug 25, 2023

drhicks commented Aug 26, 2023

jozhang97 commented Mar 19, 2024

Multi node training #126

Multi node training #126

Comments

drhicks commented Aug 18, 2023

ZiyaoLi commented Aug 25, 2023

drhicks commented Aug 26, 2023

jozhang97 commented Mar 19, 2024