Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question about multi-gpu training] #170

Open
FindTheTruth opened this issue Apr 10, 2024 · 1 comment
Open

[Question about multi-gpu training] #170

FindTheTruth opened this issue Apr 10, 2024 · 1 comment

Comments

@FindTheTruth
Copy link

when I try to train NLG model on multi-gpu,I use this:

python -m torch.distributed.launch --nproc_per_node=2  --use_env src/gpt2_ft.py \
    --train_data ./data/e2e/train.jsonl \
    --valid_data ./data/e2e/valid.jsonl \
    --train_batch_size 8 \
    --grad_acc 1 \
    --valid_batch_size 4 \
    --seq_len 512 \
    --model_card gpt2.md \
    --init_checkpoint ./pretrained_checkpoints/gpt2-medium-pytorch_model.bin \
    --platform local \
    --clip 0.0 \
    --lr 0.0002 \
    --weight_decay 0.01 \
    --correct_bias \
    --adam_beta2 0.999 \
    --scheduler linear \
    --warmup_step 500 \
    --max_epoch 5 \
    --save_interval 1000 \
    --lora_dim 4 \
    --lora_alpha 32 \
    --lora_dropout 0.1 \
    --label_smooth 0.1 \
    --work_dir ./trained_models/GPT2_M/e2e \
    --random_seed 110

but torch report an error: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
Is there any way to solve this problem?

@RayCyder
Copy link

change lm_net = lm_net.gpu() to lm_net = lm_net.to(args.device) in gpt2_ft.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants