You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi DiffuSeq! I am trying to troubleshoot my bash train.sh but got stuck with an NCCL error
############################## size of vocab 30522
### Creating model and diffusion...
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### Training...
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### Training...
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### Training...
wandb: Tracking run with wandb version 0.14.0
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
### Training...
Traceback (most recent call last):
Traceback (most recent call last):
File "train.py", line 115, in <module>
File "train.py", line 115, in <module>
Traceback (most recent call last):
File "train.py", line 115, in <module>
main()
main()
File "train.py", line 92, in main
File "train.py", line 92, in main
main()
TrainLoop(
File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
File "train.py", line 92, in main
TrainLoop(
File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
self._load_and_sync_parameters()
self._load_and_sync_parameters()
File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
TrainLoop(
File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
dist_util.sync_params(self.model.parameters())
dist_util.sync_params(self.model.parameters())
File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
self._load_and_sync_parameters()
dist.broadcast(p, 0)
dist.broadcast(p, 0)
File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
dist_util.sync_params(self.model.parameters())
File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
dist.broadcast(p, 0)
File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
work = default_pg.broadcast([tensor], opts)
work = default_pg.broadcast([tensor], opts)
**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8**
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).**
work = default_pg.broadcast([tensor], opts)
**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8**
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Traceback (most recent call last):
File "train.py", line 115, in <module>
main()
File "train.py", line 92, in main
TrainLoop(
File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
self._load_and_sync_parameters()
File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
dist_util.sync_params(self.model.parameters())
File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
dist.broadcast(p, 0)
File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
work = default_pg.broadcast([tensor], opts)
**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).**
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/cltam/test/test_DiffuSeq/DiffuSeq/wandb/offline-run-20230328_222828-w9r0x0wf
wandb: Find logs at: ./wandb/offline-run-20230328_222828-w9r0x0wf/logs
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.00020599365234375 seconds
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "92557", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "92558", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [1], \"role_rank\": [1], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 2, "group_rank": 0, "worker_id": "92559", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [2], \"role_rank\": [2], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 3, "group_rank": 0, "worker_id": "92564", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [3], \"role_rank\": [3], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}}
Any idea how to troubleshoot this? Thank you!
The text was updated successfully, but these errors were encountered:
Hi DiffuSeq! I am trying to troubleshoot my
bash train.sh
but got stuck with an NCCL errorAny idea how to troubleshoot this? Thank you!
The text was updated successfully, but these errors were encountered: