[Question about multi-gpu training] #170

FindTheTruth · 2024-04-10T14:30:35Z

when I try to train NLG model on multi-gpu,I use this:

python -m torch.distributed.launch --nproc_per_node=2  --use_env src/gpt2_ft.py \
    --train_data ./data/e2e/train.jsonl \
    --valid_data ./data/e2e/valid.jsonl \
    --train_batch_size 8 \
    --grad_acc 1 \
    --valid_batch_size 4 \
    --seq_len 512 \
    --model_card gpt2.md \
    --init_checkpoint ./pretrained_checkpoints/gpt2-medium-pytorch_model.bin \
    --platform local \
    --clip 0.0 \
    --lr 0.0002 \
    --weight_decay 0.01 \
    --correct_bias \
    --adam_beta2 0.999 \
    --scheduler linear \
    --warmup_step 500 \
    --max_epoch 5 \
    --save_interval 1000 \
    --lora_dim 4 \
    --lora_alpha 32 \
    --lora_dropout 0.1 \
    --label_smooth 0.1 \
    --work_dir ./trained_models/GPT2_M/e2e \
    --random_seed 110

but torch report an error: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
Is there any way to solve this problem?

The text was updated successfully, but these errors were encountered:

RayCyder · 2024-05-24T08:59:13Z

change lm_net = lm_net.gpu() to lm_net = lm_net.to(args.device) in gpt2_ft.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question about multi-gpu training] #170

[Question about multi-gpu training] #170

FindTheTruth commented Apr 10, 2024

RayCyder commented May 24, 2024

[Question about multi-gpu training] #170

[Question about multi-gpu training] #170

Comments

FindTheTruth commented Apr 10, 2024

RayCyder commented May 24, 2024