NCCL error #41

johnnytam100 · 2023-03-28T13:34:19Z

Hi DiffuSeq! I am trying to troubleshoot my bash train.sh but got stuck with an NCCL error

############################## size of vocab 30522
### Creating model and diffusion...
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### Training...
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### Training...
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### Training...
wandb: Tracking run with wandb version 0.14.0
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
### Training...
Traceback (most recent call last):
Traceback (most recent call last):
  File "train.py", line 115, in <module>
  File "train.py", line 115, in <module>
Traceback (most recent call last):
  File "train.py", line 115, in <module>
    main()
    main()
  File "train.py", line 92, in main
  File "train.py", line 92, in main
    main()
    TrainLoop(
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
  File "train.py", line 92, in main
    TrainLoop(
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
    self._load_and_sync_parameters()
    self._load_and_sync_parameters()
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
    TrainLoop(
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
    dist_util.sync_params(self.model.parameters())
    dist_util.sync_params(self.model.parameters())
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
    self._load_and_sync_parameters()
    dist.broadcast(p, 0)
    dist.broadcast(p, 0)
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
  File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
  File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
    dist_util.sync_params(self.model.parameters())
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
    dist.broadcast(p, 0)
  File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
    work = default_pg.broadcast([tensor], opts)
    work = default_pg.broadcast([tensor], opts)
**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8**
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).**

    work = default_pg.broadcast([tensor], opts)
**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8**
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Traceback (most recent call last):
  File "train.py", line 115, in <module>
    main()
  File "train.py", line 92, in main
    TrainLoop(
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
    self._load_and_sync_parameters()
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
    dist_util.sync_params(self.model.parameters())
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
    dist.broadcast(p, 0)
  File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
    work = default_pg.broadcast([tensor], opts)
**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).**
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/cltam/test/test_DiffuSeq/DiffuSeq/wandb/offline-run-20230328_222828-w9r0x0wf
wandb: Find logs at: ./wandb/offline-run-20230328_222828-w9r0x0wf/logs
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
  warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.00020599365234375 seconds
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "92557", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "92558", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [1], \"role_rank\": [1], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 2, "group_rank": 0, "worker_id": "92559", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [2], \"role_rank\": [2], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 3, "group_rank": 0, "worker_id": "92564", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [3], \"role_rank\": [3], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}}

Any idea how to troubleshoot this? Thank you!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL error #41

NCCL error #41

johnnytam100 commented Mar 28, 2023

NCCL error #41

NCCL error #41

Comments

johnnytam100 commented Mar 28, 2023