Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多节点sft一直卡在这里,微调llama3 8b #3534

Open
1 task done
gongye19 opened this issue May 1, 2024 · 1 comment
Open
1 task done

多节点sft一直卡在这里,微调llama3 8b #3534

gongye19 opened this issue May 1, 2024 · 1 comment
Labels
pending This problem is yet to be addressed.

Comments

@gongye19
Copy link

gongye19 commented May 1, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

[default0]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default1]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default4]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default4]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default0]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default1]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default2]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default7]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default3]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default6]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default5]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default5]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default7]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default2]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default3]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default6]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default0]:[2024-05-01 10:27:12,374] [INFO] [comm.py:637:init_distributed] cdb=None
[default1]:[2024-05-01 10:27:12,371] [INFO] [comm.py:637:init_distributed] cdb=None
[default4]:[2024-05-01 10:27:12,373] [INFO] [comm.py:637:init_distributed] cdb=None
[default4]:[2024-05-01 10:27:12,373] [INFO] [comm.py:637:init_distributed] cdb=None
[default0]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None
[default0]:[2024-05-01 10:27:12,372] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[default2]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None
[default1]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None
[default5]:[2024-05-01 10:27:12,373] [INFO] [comm.py:637:init_distributed] cdb=None
[default3]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None
[default7]:[2024-05-01 10:27:12,373] [INFO] [comm.py:637:init_distributed] cdb=None
[default6]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None
[default7]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None
[default5]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None
[default2]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None
[default3]:[2024-05-01 10:27:12,371] [INFO] [comm.py:637:init_distributed] cdb=None
[default6]:[2024-05-01 10:27:12,371] [INFO] [comm.py:637:init_distributed] cdb=None
[default3]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default5]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 5, device: cuda:5, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default7]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 7, device: cuda:7, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default0]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default1]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default4]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 4, device: cuda:4, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default0]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default1]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default2]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default6]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 6, device: cuda:6, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,770 >> loading file tokenizer.json
[default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,770 >> loading file added_tokens.json
[default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,770 >> loading file special_tokens_map.json
[default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,770 >> loading file tokenizer_config.json
[default7]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 7, device: cuda:7, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default5]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 5, device: cuda:5, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,810 >> loading file tokenizer.json
[default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,810 >> loading file added_tokens.json
[default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,810 >> loading file special_tokens_map.json
[default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,810 >> loading file tokenizer_config.json
[default2]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default3]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default4]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 4, device: cuda:4, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default6]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 6, device: cuda:6, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default0]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default0]:05/01/2024 10:27:13 - INFO - llmtuner.data.loader - Loading dataset /platform_tech/zhuhan/Datasets/stage2/stg2_train_uni_format_4w.json...
[default0]:05/01/2024 10:27:13 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
[default3]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default5]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default3]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default5]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default7]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default7]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default0]:[WARNING|logging.py:314] 2024-05-01 10:27:13,048 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default0]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default0]:05/01/2024 10:27:13 - INFO - llmtuner.data.loader - Loading dataset /platform_tech/zhuhan/Datasets/stage2/stg2_train_uni_format_4w.json...
[default0]:05/01/2024 10:27:13 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
[default4]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default1]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default1]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default4]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default1]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default2]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default6]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default4]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default6]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default5]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default2]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default1]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default7]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default2]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default4]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default7]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default5]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default0]:[WARNING|logging.py:314] 2024-05-01 10:27:13,127 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default2]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default3]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default6]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default3]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default6]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default0]:dgx046:1564834:1564834 [0] NCCL INFO cudaDriverVersion 12020
[default4]:dgx046:1564838:1564838 [4] NCCL INFO cudaDriverVersion 12020
[default1]:dgx046:1564835:1564835 [1] NCCL INFO cudaDriverVersion 12020
[default4]:dgx045:217074:217074 [4] NCCL INFO cudaDriverVersion 12020
[default0]:dgx045:217070:217070 [0] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0>
[default0]:dgx045:217070:217070 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default0]:dgx045:217070:217070 [0] NCCL INFO cudaDriverVersion 12020
[default0]:NCCL version 2.14.3+cuda11.8
[default1]:dgx045:217071:217071 [1] NCCL INFO cudaDriverVersion 12020
[default2]:dgx045:217072:217072 [2] NCCL INFO cudaDriverVersion 12020
[default5]:dgx045:217075:217075 [5] NCCL INFO cudaDriverVersion 12020
[default3]:dgx045:217073:217073 [3] NCCL INFO cudaDriverVersion 12020
[default6]:dgx045:217076:217076 [6] NCCL INFO cudaDriverVersion 12020
[default7]:dgx045:217077:217077 [7] NCCL INFO cudaDriverVersion 12020
[default7]:dgx046:1564845:1564845 [7] NCCL INFO cudaDriverVersion 12020
[default5]:dgx046:1564840:1564840 [5] NCCL INFO cudaDriverVersion 12020
[default2]:dgx046:1564836:1564836 [2] NCCL INFO cudaDriverVersion 12020
[default3]:dgx046:1564837:1564837 [3] NCCL INFO cudaDriverVersion 12020
[default6]:dgx046:1564841:1564841 [6] NCCL INFO cudaDriverVersion 12020
[default4]:dgx045:217074:217074 [4] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0>
[default4]:dgx045:217074:217074 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default0]:dgx045:217070:217667 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0>
[default0]:dgx045:217070:217667 [0] NCCL INFO Using network IB
[default1]:dgx045:217071:217071 [1] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0>
[default1]:dgx045:217071:217071 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default2]:dgx045:217072:217072 [2] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0>
[default2]:dgx045:217072:217072 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default3]:dgx045:217073:217073 [3] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0>
[default5]:dgx045:217075:217075 [5] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0>
[default3]:dgx045:217073:217073 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default5]:dgx045:217075:217075 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default7]:dgx045:217077:217077 [7] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0>
[default6]:dgx045:217076:217076 [6] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0>
[default7]:dgx045:217077:217077 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default6]:dgx045:217076:217076 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default0]:dgx046:1564834:1564834 [0] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0>
[default0]:dgx046:1564834:1564834 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default1]:dgx046:1564835:1564835 [1] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0>
[default4]:dgx046:1564838:1564838 [4] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0>
[default1]:dgx046:1564835:1564835 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default4]:dgx046:1564838:1564838 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default4]:dgx045:217074:217684 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0>
[default4]:dgx045:217074:217684 [4] NCCL INFO Using network IB
[default1]:dgx045:217071:217690 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0>
[default1]:dgx045:217071:217690 [1] NCCL INFO Using network IB
[default2]:dgx045:217072:217681 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0>
[default2]:dgx045:217072:217681 [2] NCCL INFO Using network IB
[default3]:dgx045:217073:217689 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0>
[default3]:dgx045:217073:217689 [3] NCCL INFO Using network IB
[default5]:dgx045:217075:217682 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0>
[default5]:dgx045:217075:217682 [5] NCCL INFO Using network IB
[default7]:dgx045:217077:217680 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0>
[default7]:dgx045:217077:217680 [7] NCCL INFO Using network IB
[default6]:dgx045:217076:217685 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0>
[default6]:dgx045:217076:217685 [6] NCCL INFO Using network IB
[default7]:dgx046:1564845:1564845 [7] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0>
[default7]:dgx046:1564845:1564845 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default7]:dgx046:1564845:1565723 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0>
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Using network IB
[default5]:dgx046:1564840:1564840 [5] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0>
[default5]:dgx046:1564840:1564840 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default5]:dgx046:1564840:1565724 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0>
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Using network IB
[default2]:dgx046:1564836:1564836 [2] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0>
[default2]:dgx046:1564836:1564836 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default2]:dgx046:1564836:1565727 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0>
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Using network IB
[default3]:dgx046:1564837:1564837 [3] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0>
[default3]:dgx046:1564837:1564837 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default3]:dgx046:1564837:1565729 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0>
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Using network IB
[default6]:dgx046:1564841:1564841 [6] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0>
[default6]:dgx046:1564841:1564841 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default0]:dgx046:1564834:1565726 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0>
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Using network IB
[default1]:dgx046:1564835:1565728 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0>
[default4]:dgx046:1564838:1565725 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0>
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Using network IB
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Using network IB
[default6]:dgx046:1564841:1565722 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0>
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Using network IB
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
[default0]:dgx045:217070:217667 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
[default1]:dgx045:217071:217690 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
[default2]:dgx045:217072:217681 [2] NCCL INFO Setting affinity for GPU 2 to ffff0000,00000000,00000000,00000000,ffff0000
[default3]:dgx045:217073:217689 [3] NCCL INFO Setting affinity for GPU 3 to ffff0000,00000000,00000000,00000000,ffff0000
[default7]:dgx045:217077:217680 [7] NCCL INFO Setting affinity for GPU 7 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000
[default6]:dgx045:217076:217685 [6] NCCL INFO Setting affinity for GPU 6 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Setting affinity for GPU 7 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Setting affinity for GPU 2 to ffff0000,00000000,00000000,00000000,ffff0000
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Setting affinity for GPU 3 to ffff0000,00000000,00000000,00000000,ffff0000
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Setting affinity for GPU 6 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Trees [0] 9/-1/-1->8->0 [1] 9/-1/-1->8->15 [2] 9/0/-1->8->-1 [3] 9/-1/-1->8->15
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 00/0 : 7[bd000] -> 8[7000] [receive] via NET/IB/0
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 02/0 : 7[bd000] -> 8[7000] [receive] via NET/IB/0
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 00/0 : 8[7000] -> 9[f000] via P2P/IPC/read
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 02/0 : 8[7000] -> 9[f000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Trees [0] 13/-1/-1->12->11 [1] 13/-1/-1->12->11 [2] 13/-1/-1->12->11 [3] 13/-1/-1->12->11
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Trees [0] 10/-1/-1->9->8 [1] -1/-1/-1->9->8 [2] 10/-1/-1->9->8 [3] -1/-1/-1->9->8
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 00/0 : 12[87000] -> 13[90000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 02/0 : 12[87000] -> 13[90000] via P2P/IPC/read
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 00/0 : 9[f000] -> 10[47000] via P2P/IPC/read
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 02/0 : 9[f000] -> 10[47000] via P2P/IPC/read
[default4]:dgx045:217074:217684 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 00/0 : 4[87000] -> 5[90000] via P2P/IPC/read
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 02/0 : 4[87000] -> 5[90000] via P2P/IPC/read
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 00/04 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 01/04 : 0 7 6 5 1 3 4 10 8 15 14 13 9 11 12 2
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 02/04 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 03/04 : 0 7 6 5 1 3 4 10 8 15 14 13 9 11 12 2
[default0]:dgx045:217070:217667 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->7 [2] 1/-1/-1->0->8 [3] 1/-1/-1->0->7
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 00/0 : 15[bd000] -> 0[7000] [receive] via NET/IB/0
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 02/0 : 15[bd000] -> 0[7000] [receive] via NET/IB/0
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 00/0 : 0[7000] -> 1[f000] via P2P/IPC/read
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 02/0 : 0[7000] -> 1[f000] via P2P/IPC/read
[default1]:dgx045:217071:217690 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] -1/-1/-1->1->0
[default1]:dgx045:217071:217690 [1] NCCL INFO Channel 00/0 : 1[f000] -> 2[47000] via P2P/IPC/read
[default1]:dgx045:217071:217690 [1] NCCL INFO Channel 02/0 : 1[f000] -> 2[47000] via P2P/IPC/read
[default2]:dgx045:217072:217681 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/10/-1->2->-1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->10
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 00/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 02/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
[default3]:dgx045:217073:217689 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2
[default3]:dgx045:217073:217689 [3] NCCL INFO Channel 00/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
[default3]:dgx045:217073:217689 [3] NCCL INFO Channel 01/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
[default3]:dgx045:217073:217689 [3] NCCL INFO Channel 02/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
[default3]:dgx045:217073:217689 [3] NCCL INFO Channel 03/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 00/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 02/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
[default7]:dgx045:217077:217680 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] 0/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] 0/-1/-1->7->6
[default7]:dgx045:217077:217680 [7] NCCL INFO Channel 00/0 : 7[bd000] -> 8[7000] [send] via NET/IB/8
[default7]:dgx045:217077:217680 [7] NCCL INFO Channel 02/0 : 7[bd000] -> 8[7000] [send] via NET/IB/8
[default6]:dgx045:217076:217685 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5
[default6]:dgx045:217076:217685 [6] NCCL INFO Channel 00/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
[default6]:dgx045:217076:217685 [6] NCCL INFO Channel 02/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12 [2] 14/-1/-1->13->12 [3] 14/-1/-1->13->12
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 00/0 : 13[90000] -> 14[b7000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 02/0 : 13[90000] -> 14[b7000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 01/0 : 13[90000] -> 9[f000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 03/0 : 13[90000] -> 9[f000] via P2P/IPC/read
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Trees [0] -1/-1/-1->15->14 [1] 8/-1/-1->15->14 [2] -1/-1/-1->15->14 [3] 8/-1/-1->15->14
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 00/0 : 15[bd000] -> 0[7000] [send] via NET/IB/8
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 02/0 : 15[bd000] -> 0[7000] [send] via NET/IB/8
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Trees [0] 11/-1/-1->10->9 [1] 11/-1/-1->10->2 [2] 11/-1/-1->10->9 [3] 11/2/-1->10->-1
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 00/0 : 10[47000] -> 11[4e000] via P2P/IPC/read
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 02/0 : 10[47000] -> 11[4e000] via P2P/IPC/read
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Trees [0] 12/-1/-1->11->10 [1] 12/-1/-1->11->10 [2] 12/-1/-1->11->10 [3] 12/-1/-1->11->10
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 00/0 : 11[4e000] -> 12[87000] via P2P/IPC/read
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 01/0 : 11[4e000] -> 12[87000] via P2P/IPC/read
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 02/0 : 11[4e000] -> 12[87000] via P2P/IPC/read
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 03/0 : 11[4e000] -> 12[87000] via P2P/IPC/read
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13 [2] 15/-1/-1->14->13 [3] 15/-1/-1->14->13
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 00/0 : 14[b7000] -> 15[bd000] via P2P/IPC/read
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 02/0 : 14[b7000] -> 15[bd000] via P2P/IPC/read
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 01/0 : 8[7000] -> 15[bd000] via P2P/IPC/read
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 03/0 : 8[7000] -> 15[bd000] via P2P/IPC/read
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Connected all rings
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 01/0 : 8[7000] -> 9[f000] via P2P/IPC/read
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 03/0 : 8[7000] -> 9[f000] via P2P/IPC/read
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 00/0 : 0[7000] -> 8[7000] [receive] via NET/IB/4
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 02/0 : 0[7000] -> 8[7000] [receive] via NET/IB/4
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 00/0 : 8[7000] -> 0[7000] [send] via NET/IB/4
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 02/0 : 8[7000] -> 0[7000] [send] via NET/IB/4
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 01/0 : 12[87000] -> 2[47000] [send] via NET/IB/5
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 01/0 : 9[f000] -> 11[4e000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 03/0 : 12[87000] -> 2[47000] [send] via NET/IB/5
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Connected all rings
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 03/0 : 9[f000] -> 11[4e000] via P2P/IPC/read
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Connected all rings
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 00/0 : 9[f000] -> 8[7000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 01/0 : 12[87000] -> 13[90000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 03/0 : 12[87000] -> 13[90000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 00/0 : 12[87000] -> 11[4e000] via P2P/IPC/read
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 01/0 : 9[f000] -> 8[7000] via P2P/IPC/read
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 02/0 : 9[f000] -> 8[7000] via P2P/IPC/read
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 03/0 : 9[f000] -> 8[7000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 01/0 : 12[87000] -> 11[4e000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 02/0 : 12[87000] -> 11[4e000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 03/0 : 12[87000] -> 11[4e000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Connected all trees
[default4]:dgx046:1564838:1565725 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default4]:dgx046:1564838:1565725 [4] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 01/0 : 4[87000] -> 10[47000] [send] via NET/IB/5
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 03/0 : 4[87000] -> 10[47000] [send] via NET/IB/5
[default4]:dgx045:217074:217684 [4] NCCL INFO Connected all rings
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 01/0 : 4[87000] -> 5[90000] via P2P/IPC/read
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 03/0 : 4[87000] -> 5[90000] via P2P/IPC/read
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 00/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 01/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 02/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 03/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
[default4]:dgx045:217074:217684 [4] NCCL INFO Connected all trees
[default4]:dgx045:217074:217684 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default4]:dgx045:217074:217684 [4] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 01/0 : 0[7000] -> 7[bd000] via P2P/IPC/read
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 03/0 : 0[7000] -> 7[bd000] via P2P/IPC/read
[default0]:dgx045:217070:217667 [0] NCCL INFO Connected all rings
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 01/0 : 0[7000] -> 1[f000] via P2P/IPC/read
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 03/0 : 0[7000] -> 1[f000] via P2P/IPC/read
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 00/0 : 8[7000] -> 0[7000] [receive] via NET/IB/4
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 02/0 : 8[7000] -> 0[7000] [receive] via NET/IB/4
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 00/0 : 0[7000] -> 8[7000] [send] via NET/IB/4
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 02/0 : 0[7000] -> 8[7000] [send] via NET/IB/4
[default1]:dgx045:217071:217690 [1] NCCL INFO Channel 01/0 : 1[f000] -> 3[4e000] via P2P/IPC/read
[default1]:dgx045:217071:217690 [1] NCCL INFO Channel 03/0 : 1[f000] -> 3[4e000] via P2P/IPC/read
[default1]:dgx045:217071:217690 [1] NCCL INFO Connected all rings
[default1]:dgx045:217071:217690 [1] NCCL INFO Channel 00/0 : 1[f000] -> 0[7000] via P2P/IPC/read
[default1]:dgx045:217071:217690 [1] NCCL INFO Channel 01/0 : 1[f000] -> 0[7000] via P2P/IPC/read
[default1]:dgx045:217071:217690 [1] NCCL INFO Channel 02/0 : 1[f000] -> 0[7000] via P2P/IPC/read
[default1]:dgx045:217071:217690 [1] NCCL INFO Channel 03/0 : 1[f000] -> 0[7000] via P2P/IPC/read
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 01/0 : 12[87000] -> 2[47000] [receive] via NET/IB/2
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 03/0 : 12[87000] -> 2[47000] [receive] via NET/IB/2
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 01/0 : 2[47000] -> 0[7000] via P2P/IPC/read
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 03/0 : 2[47000] -> 0[7000] via P2P/IPC/read
[default2]:dgx045:217072:217681 [2] NCCL INFO Connected all rings
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 01/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 03/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
[default3]:dgx045:217073:217689 [3] NCCL INFO Connected all rings
[default3]:dgx045:217073:217689 [3] NCCL INFO Channel 00/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
[default3]:dgx045:217073:217689 [3] NCCL INFO Channel 01/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
[default3]:dgx045:217073:217689 [3] NCCL INFO Channel 02/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
[default3]:dgx045:217073:217689 [3] NCCL INFO Channel 03/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
[default7]:dgx045:217077:217680 [7] NCCL INFO Channel 01/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
[default6]:dgx045:217076:217685 [6] NCCL INFO Channel 01/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
[default6]:dgx045:217076:217685 [6] NCCL INFO Channel 03/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
[default7]:dgx045:217077:217680 [7] NCCL INFO Channel 03/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
[default7]:dgx045:217077:217680 [7] NCCL INFO Connected all rings
[default7]:dgx045:217077:217680 [7] NCCL INFO Channel 01/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
[default6]:dgx045:217076:217685 [6] NCCL INFO Connected all rings
[default7]:dgx045:217077:217680 [7] NCCL INFO Channel 03/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
[default6]:dgx045:217076:217685 [6] NCCL INFO Channel 01/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 01/0 : 5[90000] -> 1[f000] via P2P/IPC/read
[default6]:dgx045:217076:217685 [6] NCCL INFO Channel 03/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
[default6]:dgx045:217076:217685 [6] NCCL INFO Channel 00/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 03/0 : 5[90000] -> 1[f000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Connected all rings
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 01/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
[default6]:dgx045:217076:217685 [6] NCCL INFO Channel 02/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 03/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 00/0 : 5[90000] -> 4[87000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 01/0 : 5[90000] -> 4[87000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 02/0 : 5[90000] -> 4[87000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 03/0 : 5[90000] -> 4[87000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Connected all trees
[default5]:dgx045:217075:217682 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default5]:dgx045:217075:217682 [5] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Connected all rings
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 01/0 : 15[bd000] -> 14[b7000] via P2P/IPC/read
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 03/0 : 15[bd000] -> 14[b7000] via P2P/IPC/read
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Connected all rings
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 01/0 : 13[90000] -> 14[b7000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 03/0 : 13[90000] -> 14[b7000] via P2P/IPC/read
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 01/0 : 15[bd000] -> 8[7000] via P2P/IPC/read
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 03/0 : 15[bd000] -> 8[7000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 00/0 : 13[90000] -> 12[87000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 01/0 : 13[90000] -> 12[87000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 02/0 : 13[90000] -> 12[87000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 03/0 : 13[90000] -> 12[87000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Connected all trees
[default5]:dgx046:1564840:1565724 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default5]:dgx046:1564840:1565724 [5] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 01/0 : 4[87000] -> 10[47000] [receive] via NET/IB/2
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 03/0 : 4[87000] -> 10[47000] [receive] via NET/IB/2
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 01/0 : 10[47000] -> 8[7000] via P2P/IPC/read
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 03/0 : 10[47000] -> 8[7000] via P2P/IPC/read
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Connected all rings
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 01/0 : 10[47000] -> 11[4e000] via P2P/IPC/read
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 03/0 : 10[47000] -> 11[4e000] via P2P/IPC/read
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 01/0 : 2[47000] -> 10[47000] [receive] via NET/IB/9
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 03/0 : 2[47000] -> 10[47000] [receive] via NET/IB/9
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 01/0 : 10[47000] -> 2[47000] [send] via NET/IB/9
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 03/0 : 10[47000] -> 2[47000] [send] via NET/IB/9
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Connected all rings
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 00/0 : 11[4e000] -> 10[47000] via P2P/IPC/read
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 01/0 : 11[4e000] -> 10[47000] via P2P/IPC/read
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 02/0 : 11[4e000] -> 10[47000] via P2P/IPC/read
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 03/0 : 11[4e000] -> 10[47000] via P2P/IPC/read
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 01/0 : 14[b7000] -> 13[90000] via P2P/IPC/read
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 03/0 : 14[b7000] -> 13[90000] via P2P/IPC/read
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Connected all rings
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 01/0 : 14[b7000] -> 15[bd000] via P2P/IPC/read
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 03/0 : 14[b7000] -> 15[bd000] via P2P/IPC/read
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 00/0 : 14[b7000] -> 13[90000] via P2P/IPC/read
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 02/0 : 14[b7000] -> 13[90000] via P2P/IPC/read
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Connected all trees
[default0]:dgx046:1564834:1565726 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default0]:dgx046:1564834:1565726 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Connected all trees
[default1]:dgx046:1564835:1565728 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default1]:dgx046:1564835:1565728 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default0]:dgx045:217070:217667 [0] NCCL INFO Connected all trees
[default0]:dgx045:217070:217667 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default0]:dgx045:217070:217667 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default1]:dgx045:217071:217690 [1] NCCL INFO Connected all trees
[default1]:dgx045:217071:217690 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default1]:dgx045:217071:217690 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 01/0 : 10[47000] -> 2[47000] [receive] via NET/IB/9
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 03/0 : 10[47000] -> 2[47000] [receive] via NET/IB/9
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 01/0 : 2[47000] -> 10[47000] [send] via NET/IB/9
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 03/0 : 2[47000] -> 10[47000] [send] via NET/IB/9
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 00/0 : 2[47000] -> 1[f000] via P2P/IPC/read
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 02/0 : 2[47000] -> 1[f000] via P2P/IPC/read
[default2]:dgx045:217072:217681 [2] NCCL INFO Connected all trees
[default2]:dgx045:217072:217681 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default2]:dgx045:217072:217681 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default3]:dgx045:217073:217689 [3] NCCL INFO Connected all trees
[default3]:dgx045:217073:217689 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default3]:dgx045:217073:217689 [3] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default7]:dgx045:217077:217680 [7] NCCL INFO Channel 00/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
[default7]:dgx045:217077:217680 [7] NCCL INFO Channel 02/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
[default7]:dgx045:217077:217680 [7] NCCL INFO Connected all trees
[default7]:dgx045:217077:217680 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default7]:dgx045:217077:217680 [7] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default6]:dgx045:217076:217685 [6] NCCL INFO Connected all trees
[default6]:dgx045:217076:217685 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default6]:dgx045:217076:217685 [6] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 00/0 : 15[bd000] -> 14[b7000] via P2P/IPC/read
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 02/0 : 15[bd000] -> 14[b7000] via P2P/IPC/read
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Connected all trees
[default7]:dgx046:1564845:1565723 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default7]:dgx046:1564845:1565723 [7] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default7]:dgx046:1564845:1565723 [7] NCCL INFO comm 0x1361d4f0 rank 15 nranks 16 cudaDev 7 busId bd000 - Init COMPLETE
[default5]:dgx046:1564840:1565724 [5] NCCL INFO comm 0x13626f30 rank 13 nranks 16 cudaDev 5 busId 90000 - Init COMPLETE
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 00/0 : 10[47000] -> 9[f000] via P2P/IPC/read
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 02/0 : 10[47000] -> 9[f000] via P2P/IPC/read
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Connected all trees
[default2]:dgx046:1564836:1565727 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default2]:dgx046:1564836:1565727 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default2]:dgx046:1564836:1565727 [2] NCCL INFO comm 0x12c889b0 rank 10 nranks 16 cudaDev 2 busId 47000 - Init COMPLETE
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Connected all trees
[default3]:dgx046:1564837:1565729 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default3]:dgx046:1564837:1565729 [3] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default3]:dgx046:1564837:1565729 [3] NCCL INFO comm 0x13bfe5e0 rank 11 nranks 16 cudaDev 3 busId 4e000 - Init COMPLETE
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Connected all trees
[default6]:dgx046:1564841:1565722 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default6]:dgx046:1564841:1565722 [6] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default6]:dgx046:1564841:1565722 [6] NCCL INFO comm 0x136ab020 rank 14 nranks 16 cudaDev 6 busId b7000 - Init COMPLETE
[default0]:dgx046:1564834:1565726 [0] NCCL INFO comm 0x13327b50 rank 8 nranks 16 cudaDev 0 busId 7000 - Init COMPLETE
[default1]:dgx046:1564835:1565728 [1] NCCL INFO comm 0x142c6490 rank 9 nranks 16 cudaDev 1 busId f000 - Init COMPLETE
[default4]:dgx046:1564838:1565725 [4] NCCL INFO comm 0x13a85850 rank 12 nranks 16 cudaDev 4 busId 87000 - Init COMPLETE
[default4]:dgx045:217074:217684 [4] NCCL INFO comm 0x129fd1b0 rank 4 nranks 16 cudaDev 4 busId 87000 - Init COMPLETE
[default0]:dgx045:217070:217667 [0] NCCL INFO comm 0x1300fd70 rank 0 nranks 16 cudaDev 0 busId 7000 - Init COMPLETE
[default1]:dgx045:217071:217690 [1] NCCL INFO comm 0x12854cb0 rank 1 nranks 16 cudaDev 1 busId f000 - Init COMPLETE
[default2]:dgx045:217072:217681 [2] NCCL INFO comm 0x12e8f850 rank 2 nranks 16 cudaDev 2 busId 47000 - Init COMPLETE
[default3]:dgx045:217073:217689 [3] NCCL INFO comm 0x1273c450 rank 3 nranks 16 cudaDev 3 busId 4e000 - Init COMPLETE
[default7]:dgx045:217077:217680 [7] NCCL INFO comm 0x144e2870 rank 7 nranks 16 cudaDev 7 busId bd000 - Init COMPLETE
[default5]:dgx045:217075:217682 [5] NCCL INFO comm 0x12ea1b00 rank 5 nranks 16 cudaDev 5 busId 90000 - Init COMPLETE
[default6]:dgx045:217076:217685 [6] NCCL INFO comm 0x1420d660 rank 6 nranks 16 cudaDev 6 busId b7000 - Init COMPLETE
image

Expected behavior

No response

System Info

No response

Others

No response

@hiyouga hiyouga added the pending This problem is yet to be addressed. label May 1, 2024
@xujunrt
Copy link

xujunrt commented May 7, 2024

遇到类似的问题,请问解决了吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed.
Projects
None yet
Development

No branches or pull requests

3 participants