Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL error #41

Open
johnnytam100 opened this issue Mar 28, 2023 · 0 comments
Open

NCCL error #41

johnnytam100 opened this issue Mar 28, 2023 · 0 comments

Comments

@johnnytam100
Copy link

Hi DiffuSeq! I am trying to troubleshoot my bash train.sh but got stuck with an NCCL error

############################## size of vocab 30522
### Creating model and diffusion...
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### Training...
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### Training...
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### Training...
wandb: Tracking run with wandb version 0.14.0
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
### Training...
Traceback (most recent call last):
Traceback (most recent call last):
  File "train.py", line 115, in <module>
  File "train.py", line 115, in <module>
Traceback (most recent call last):
  File "train.py", line 115, in <module>
    main()
    main()
  File "train.py", line 92, in main
  File "train.py", line 92, in main
    main()
    TrainLoop(
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
  File "train.py", line 92, in main
    TrainLoop(
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
    self._load_and_sync_parameters()
    self._load_and_sync_parameters()
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
    TrainLoop(
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
    dist_util.sync_params(self.model.parameters())
    dist_util.sync_params(self.model.parameters())
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
    self._load_and_sync_parameters()
    dist.broadcast(p, 0)
    dist.broadcast(p, 0)
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
  File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
  File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
    dist_util.sync_params(self.model.parameters())
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
    dist.broadcast(p, 0)
  File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
    work = default_pg.broadcast([tensor], opts)
    work = default_pg.broadcast([tensor], opts)
**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8**
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).**

    work = default_pg.broadcast([tensor], opts)
**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8**
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Traceback (most recent call last):
  File "train.py", line 115, in <module>
    main()
  File "train.py", line 92, in main
    TrainLoop(
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
    self._load_and_sync_parameters()
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
    dist_util.sync_params(self.model.parameters())
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
    dist.broadcast(p, 0)
  File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
    work = default_pg.broadcast([tensor], opts)
**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).**
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/cltam/test/test_DiffuSeq/DiffuSeq/wandb/offline-run-20230328_222828-w9r0x0wf
wandb: Find logs at: ./wandb/offline-run-20230328_222828-w9r0x0wf/logs
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
  warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.00020599365234375 seconds
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "92557", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "92558", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [1], \"role_rank\": [1], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 2, "group_rank": 0, "worker_id": "92559", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [2], \"role_rank\": [2], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 3, "group_rank": 0, "worker_id": "92564", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [3], \"role_rank\": [3], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}}

Any idea how to troubleshoot this? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant