How to train the model with double gpu? #780

kl402401 · 2024-04-12T01:49:21Z

I train the model with double gpu, but it get something wrong. why?
! CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port=21 realesrgan/train.py -opt options/finetune_realesrgan_x4plus_pairdata.yml --auto_resume

train.py: error: unrecognized arguments: --local-rank=1
[2024-04-12 09:48:38,075] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 69084) of binary: /data/envs/geo_real_esrgan/bin/python
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launch.py", line 198, in
main()
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launch.py", line 194, in main
launch(args)
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launch.py", line 179, in launch
run(args)
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

realesrgan/train.py FAILED

Failures:
[1]:
time : 2024-04-12_09:48:38
host : geo517
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 69085)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-04-12_09:48:38
host : geo517
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 69084)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

kl402401 · 2024-04-12T01:51:48Z

buy the way,I have two gpu cards

kl402401 · 2024-04-12T07:09:12Z

solve:
CUDA_VISIBLE_DEVICES=0,1
python -m torch.distributed.launch --nproc_per_node=2 --master_port=4321 realesrgan/train.py -opt options/finetune_realesrgan_x4plus_pairdata.yml --launcher pytorch --auto_resume
change
CUDA_VISIBLE_DEVICES=0,1
torchrun --nproc_per_node=2 --master_port=4321 realesrgan/train.py -opt options/finetune_realesrgan_x4plus_pairdata.yml --launcher pytorch --auto_resume

torchrun replace python -m torch.distributed.launch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train the model with double gpu? #780

How to train the model with double gpu? #780

kl402401 commented Apr 12, 2024

kl402401 commented Apr 12, 2024

kl402401 commented Apr 12, 2024

How to train the model with double gpu? #780

How to train the model with double gpu? #780

Comments

kl402401 commented Apr 12, 2024

realesrgan/train.py FAILED

Failures: [1]: time : 2024-04-12_09:48:38 host : geo517 rank : 1 (local_rank: 1) exitcode : 2 (pid: 69085) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-04-12_09:48:38 host : geo517 rank : 0 (local_rank: 0) exitcode : 2 (pid: 69084) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

kl402401 commented Apr 12, 2024

kl402401 commented Apr 12, 2024

Failures:
[1]:
time : 2024-04-12_09:48:38
host : geo517
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 69085)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-04-12_09:48:38
host : geo517
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 69084)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html