You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I train the model with double gpu, but it get something wrong. why?
! CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port=21 realesrgan/train.py -opt options/finetune_realesrgan_x4plus_pairdata.yml --auto_resume
train.py: error: unrecognized arguments: --local-rank=1
[2024-04-12 09:48:38,075] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 69084) of binary: /data/envs/geo_real_esrgan/bin/python
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launch.py", line 198, in
main()
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launch.py", line 194, in main
launch(args)
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launch.py", line 179, in launch
run(args)
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I train the model with double gpu, but it get something wrong. why?
! CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port=21 realesrgan/train.py -opt options/finetune_realesrgan_x4plus_pairdata.yml --auto_resume
train.py: error: unrecognized arguments: --local-rank=1
[2024-04-12 09:48:38,075] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 69084) of binary: /data/envs/geo_real_esrgan/bin/python
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launch.py", line 198, in
main()
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launch.py", line 194, in main
launch(args)
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launch.py", line 179, in launch
run(args)
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
realesrgan/train.py FAILED
Failures:
[1]:
time : 2024-04-12_09:48:38
host : geo517
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 69085)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2024-04-12_09:48:38
host : geo517
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 69084)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The text was updated successfully, but these errors were encountered: