Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiple GPU training issue #8

Open
Xiaoyang-Rebecca opened this issue Apr 22, 2022 · 1 comment
Open

multiple GPU training issue #8

Xiaoyang-Rebecca opened this issue Apr 22, 2022 · 1 comment

Comments

@Xiaoyang-Rebecca
Copy link

Xiaoyang-Rebecca commented Apr 22, 2022

Hi, thanks for open sourcing the code. I have tried using multiple GPU training below


python train.py \
	--batchSize 8 \
	--nThreads 8 \
	--name "$exp_name" \
	--load_pretrained_g_ema "$pretrain_weight" \
	--train_image_dir "$dataset_root"/"img_512" \
	--train_image_list "$dataset_root"/"train_img_list.txt" \
	--train_image_postfix ".png" \
	--val_image_dir "$dataset_root""/img_512" \
	--val_image_list "$dataset_root"/"val_mask_list.txt" \
	--val_mask_dir "$dataset_root"/"mask_512" \
	--val_image_postfix ".png" \
	--load_size 512 \
	--crop_size 512 \
	--z_dim 512 \
	--validation_freq 10000 \
	--niter 50 \
	--dataset_mode trainimage \
	--trainer stylegan2 \
	--dataset_mode_train trainimage \
	--dataset_mode_val valimage \
	--model comod \
	--netG comodgan \
	--netD comodgan \
	--no_l1_loss \
	--no_vgg_loss \
	--preprocess_mode scale_shortside_and_crop \
	--save_epoch_freq 10 \
	--gpu_id 0,1,2,3
	$EXTRA

and received the error: (This problem didn't have in the single gpu training)

(epoch: 1, iters: 9904, time: 0.171) GAN: 1.7399 path: 0.0003 D_real: 0.4633 D_Fake: 0.6500 r1: 0.2954
(epoch: 1, iters: 10000, time: 0.215) GAN: 1.4925 path: 0.0003 D_real: 0.3935 D_Fake: 0.9652 r1: 0.2954
saving the latest model (epoch 1, total_steps 10000)
Saved current iteration count at ./checkpoints/comod-ffhq-512-4gpus/iter.txt.
doing validation
warnings.warn('Was asked to gather along dimension 0, but all '
Traceback (most recent call last):
File "train.py", line 138, in
generated,_ = model(data_ii, mode='inference')
File "/binaries/anaconda3/envs/torch_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/binaries/anaconda3/envs/torch_py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/binaries/anaconda3/envs/torch_py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/binaries/anaconda3/envs/torch_py36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/binaries/anaconda3/envs/torch_py36/lib/python3.6/site-packages/torch/_utils.py", line 434, in reraise
raise exception
TypeError: Caught TypeError in replica 3 on device 3.
Original Traceback (most recent call last):
File "/binaries/anaconda3/envs/torch_py36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/binaries/anaconda3/envs/torch_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'data'

Do you know what could be the problem?

@zengxianyu
Copy link
Owner

zengxianyu commented May 7, 2022

I had no problem last time I tried training on multiple gpu. I have no access to multiple gpu currently. I'll look into this issue later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants