Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RunTimeError related with NCCL when training librimix recipe #683

Open
garcesote opened this issue Nov 6, 2023 · 3 comments
Open

RunTimeError related with NCCL when training librimix recipe #683

garcesote opened this issue Nov 6, 2023 · 3 comments
Labels
question Further information is requested

Comments

@garcesote
Copy link

Hi,

I'm trying to train the librimix recipe code and I'm getting the same error when I try to use my GPU for training:

RuntimeError("Distributed package doesn't have NCCL " "built in")

torch.cuda.current_device() is returning a GPU named 0 in my python but when I enter it like this:

./run.sh --stage 2 --id 0

to train my model with the gpu it returns that runtime error.

Is it necessary to have NCCL in my systemto train the example? Or is it only that I'm making an error in the process of training.

This is my complete output in order anyone can help me:

Results from the following experiment will be stored in exp/train_convtasnet_4a19572d
Stage 2: Training
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Trainer(limit_train_batches=1.0) was configured so 100% of the batches per epoch will be used..
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [cie-dpt-71969.dyc.a.unavarra.es]:53168 (system error: 10049 - La direcci▒n solicitada no es v▒lida en este contexto.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [cie-dpt-71969.dyc.a.unavarra.es]:53168 (system error: 10049 - La direcci▒n solicitada no es v▒lida en este contexto.).
{'data': {'n_src': 2,
'sample_rate': 8000,
'segment': 3,
'task': 'sep_clean',
'train_dir': 'data/wav8k/min/metadata/train-360',
'valid_dir': 'data/wav8k/min/metadata/dev'},
'filterbank': {'kernel_size': 16, 'n_filters': 512, 'stride': 8},
'main_args': {'exp_dir': 'exp/train_convtasnet_4a19572d', 'help': None},
'masknet': {'bn_chan': 128,
'hid_chan': 512,
'mask_act': 'relu',
'n_blocks': 8,
'n_repeats': 3,
'skip_chan': 128},
'optim': {'lr': 0.001, 'optimizer': 'adam', 'weight_decay': 0.0},
'positional arguments': {},
'training': {'batch_size': 24,
'early_stop': True,
'epochs': 200,
'half_lr': True,
'num_workers': 4}}
Drop 0 utterances from 50800 (shorter than 3 seconds)
Drop 0 utterances from 3000 (shorter than 3 seconds)
Traceback (most recent call last):
File "C:\Users\jaulab\Desktop\SourceSeparation\asteroid\egs\librimix\ConvTasNet\train.py", line 143, in
main(arg_dic)
File "C:\Users\jaulab\Desktop\SourceSeparation\asteroid\egs\librimix\ConvTasNet\train.py", line 109, in main
trainer.fit(system)
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 532, in fit
call._call_and_handle_interrupt(
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\call.py", line 42, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 571, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 938, in _run
self.strategy.setup_environment()
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\strategies\ddp.py", line 143, in setup_environment
self.setup_distributed()
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\strategies\ddp.py", line 191, in setup_distributed
_init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\lightning_fabric\utilities\distributed.py", line 258, in _init_dist_connection
torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group
default_pg = _new_process_group_helper(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in

Thank you in advance.

@garcesote garcesote added the question Further information is requested label Nov 6, 2023
@mpariente
Copy link
Collaborator

Have you looked for this bug somewhere else ? It doesn't seem to be related to Asteroid.

@garcesote
Copy link
Author

https://discuss.pytorch.org/t/runtimeerror-distributed-package-doesnt-have-nccl-built-in/176744

Reading the link, it seems that in the process of training with my GPU, the recipe is trying to use NCCL. However I'm training the model on Windows where it's not possible to work with NCCL. ¿Any ideas how can I solve this, do I have to try it in another OS or there's a way of training it without NCCL?

@mpariente
Copy link
Collaborator

I'm sorry but I have no idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants