We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[default0]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [default1]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [default4]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [default4]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [default0]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [default1]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [default2]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [default7]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [default3]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [default6]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [default5]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [default5]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [default7]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [default2]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [default3]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [default6]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [default0]:[2024-05-01 10:27:12,374] [INFO] [comm.py:637:init_distributed] cdb=None [default1]:[2024-05-01 10:27:12,371] [INFO] [comm.py:637:init_distributed] cdb=None [default4]:[2024-05-01 10:27:12,373] [INFO] [comm.py:637:init_distributed] cdb=None [default4]:[2024-05-01 10:27:12,373] [INFO] [comm.py:637:init_distributed] cdb=None [default0]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None [default0]:[2024-05-01 10:27:12,372] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [default2]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None [default1]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None [default5]:[2024-05-01 10:27:12,373] [INFO] [comm.py:637:init_distributed] cdb=None [default3]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None [default7]:[2024-05-01 10:27:12,373] [INFO] [comm.py:637:init_distributed] cdb=None [default6]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None [default7]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None [default5]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None [default2]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None [default3]:[2024-05-01 10:27:12,371] [INFO] [comm.py:637:init_distributed] cdb=None [default6]:[2024-05-01 10:27:12,371] [INFO] [comm.py:637:init_distributed] cdb=None [default3]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [default5]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 5, device: cuda:5, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [default7]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 7, device: cuda:7, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [default0]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [default1]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [default4]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 4, device: cuda:4, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [default0]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [default1]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [default2]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [default6]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 6, device: cuda:6, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,770 >> loading file tokenizer.json [default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,770 >> loading file added_tokens.json [default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,770 >> loading file special_tokens_map.json [default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,770 >> loading file tokenizer_config.json [default7]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 7, device: cuda:7, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [default5]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 5, device: cuda:5, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,810 >> loading file tokenizer.json [default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,810 >> loading file added_tokens.json [default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,810 >> loading file special_tokens_map.json [default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,810 >> loading file tokenizer_config.json [default2]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [default3]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [default4]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 4, device: cuda:4, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [default6]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 6, device: cuda:6, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [default0]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|> [default0]:05/01/2024 10:27:13 - INFO - llmtuner.data.loader - Loading dataset /platform_tech/zhuhan/Datasets/stage2/stg2_train_uni_format_4w.json... [default0]:05/01/2024 10:27:13 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json. [default3]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|> [default5]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|> [default3]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [default5]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [default7]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|> [default7]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [default0]:[WARNING|logging.py:314] 2024-05-01 10:27:13,048 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [default0]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|> [default0]:05/01/2024 10:27:13 - INFO - llmtuner.data.loader - Loading dataset /platform_tech/zhuhan/Datasets/stage2/stg2_train_uni_format_4w.json... [default0]:05/01/2024 10:27:13 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json. [default4]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|> [default1]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|> [default1]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [default4]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|> [default1]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|> [default2]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|> [default6]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [default4]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [default6]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|> [default5]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [default2]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [default1]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [default7]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|> [default2]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [default4]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [default7]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [default5]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|> [default0]:[WARNING|logging.py:314] 2024-05-01 10:27:13,127 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [default2]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|> [default3]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [default6]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [default3]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|> [default6]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|> [default0]:dgx046:1564834:1564834 [0] NCCL INFO cudaDriverVersion 12020 [default4]:dgx046:1564838:1564838 [4] NCCL INFO cudaDriverVersion 12020 [default1]:dgx046:1564835:1564835 [1] NCCL INFO cudaDriverVersion 12020 [default4]:dgx045:217074:217074 [4] NCCL INFO cudaDriverVersion 12020 [default0]:dgx045:217070:217070 [0] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0> [default0]:dgx045:217070:217070 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation [default0]:dgx045:217070:217070 [0] NCCL INFO cudaDriverVersion 12020 [default0]:NCCL version 2.14.3+cuda11.8 [default1]:dgx045:217071:217071 [1] NCCL INFO cudaDriverVersion 12020 [default2]:dgx045:217072:217072 [2] NCCL INFO cudaDriverVersion 12020 [default5]:dgx045:217075:217075 [5] NCCL INFO cudaDriverVersion 12020 [default3]:dgx045:217073:217073 [3] NCCL INFO cudaDriverVersion 12020 [default6]:dgx045:217076:217076 [6] NCCL INFO cudaDriverVersion 12020 [default7]:dgx045:217077:217077 [7] NCCL INFO cudaDriverVersion 12020 [default7]:dgx046:1564845:1564845 [7] NCCL INFO cudaDriverVersion 12020 [default5]:dgx046:1564840:1564840 [5] NCCL INFO cudaDriverVersion 12020 [default2]:dgx046:1564836:1564836 [2] NCCL INFO cudaDriverVersion 12020 [default3]:dgx046:1564837:1564837 [3] NCCL INFO cudaDriverVersion 12020 [default6]:dgx046:1564841:1564841 [6] NCCL INFO cudaDriverVersion 12020 [default4]:dgx045:217074:217074 [4] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0> [default4]:dgx045:217074:217074 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation [default0]:dgx045:217070:217667 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0> [default0]:dgx045:217070:217667 [0] NCCL INFO Using network IB [default1]:dgx045:217071:217071 [1] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0> [default1]:dgx045:217071:217071 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation [default2]:dgx045:217072:217072 [2] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0> [default2]:dgx045:217072:217072 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation [default3]:dgx045:217073:217073 [3] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0> [default5]:dgx045:217075:217075 [5] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0> [default3]:dgx045:217073:217073 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation [default5]:dgx045:217075:217075 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation [default7]:dgx045:217077:217077 [7] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0> [default6]:dgx045:217076:217076 [6] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0> [default7]:dgx045:217077:217077 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation [default6]:dgx045:217076:217076 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation [default0]:dgx046:1564834:1564834 [0] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0> [default0]:dgx046:1564834:1564834 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation [default1]:dgx046:1564835:1564835 [1] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0> [default4]:dgx046:1564838:1564838 [4] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0> [default1]:dgx046:1564835:1564835 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation [default4]:dgx046:1564838:1564838 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation [default4]:dgx045:217074:217684 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0> [default4]:dgx045:217074:217684 [4] NCCL INFO Using network IB [default1]:dgx045:217071:217690 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0> [default1]:dgx045:217071:217690 [1] NCCL INFO Using network IB [default2]:dgx045:217072:217681 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0> [default2]:dgx045:217072:217681 [2] NCCL INFO Using network IB [default3]:dgx045:217073:217689 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0> [default3]:dgx045:217073:217689 [3] NCCL INFO Using network IB [default5]:dgx045:217075:217682 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0> [default5]:dgx045:217075:217682 [5] NCCL INFO Using network IB [default7]:dgx045:217077:217680 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0> [default7]:dgx045:217077:217680 [7] NCCL INFO Using network IB [default6]:dgx045:217076:217685 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0> [default6]:dgx045:217076:217685 [6] NCCL INFO Using network IB [default7]:dgx046:1564845:1564845 [7] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0> [default7]:dgx046:1564845:1564845 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation [default7]:dgx046:1564845:1565723 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0> [default7]:dgx046:1564845:1565723 [7] NCCL INFO Using network IB [default5]:dgx046:1564840:1564840 [5] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0> [default5]:dgx046:1564840:1564840 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation [default5]:dgx046:1564840:1565724 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0> [default5]:dgx046:1564840:1565724 [5] NCCL INFO Using network IB [default2]:dgx046:1564836:1564836 [2] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0> [default2]:dgx046:1564836:1564836 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation [default2]:dgx046:1564836:1565727 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0> [default2]:dgx046:1564836:1565727 [2] NCCL INFO Using network IB [default3]:dgx046:1564837:1564837 [3] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0> [default3]:dgx046:1564837:1564837 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation [default3]:dgx046:1564837:1565729 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0> [default3]:dgx046:1564837:1565729 [3] NCCL INFO Using network IB [default6]:dgx046:1564841:1564841 [6] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0> [default6]:dgx046:1564841:1564841 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation [default0]:dgx046:1564834:1565726 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0> [default0]:dgx046:1564834:1565726 [0] NCCL INFO Using network IB [default1]:dgx046:1564835:1565728 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0> [default4]:dgx046:1564838:1565725 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0> [default4]:dgx046:1564838:1565725 [4] NCCL INFO Using network IB [default1]:dgx046:1564835:1565728 [1] NCCL INFO Using network IB [default6]:dgx046:1564841:1565722 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0> [default6]:dgx046:1564841:1565722 [6] NCCL INFO Using network IB [default0]:dgx046:1564834:1565726 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000,00000000,00000000,ffff0000,00000000 [default1]:dgx046:1564835:1565728 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,00000000,00000000,00000000,ffff0000,00000000 [default0]:dgx045:217070:217667 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000,00000000,00000000,ffff0000,00000000 [default1]:dgx045:217071:217690 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,00000000,00000000,00000000,ffff0000,00000000 [default2]:dgx045:217072:217681 [2] NCCL INFO Setting affinity for GPU 2 to ffff0000,00000000,00000000,00000000,ffff0000 [default3]:dgx045:217073:217689 [3] NCCL INFO Setting affinity for GPU 3 to ffff0000,00000000,00000000,00000000,ffff0000 [default7]:dgx045:217077:217680 [7] NCCL INFO Setting affinity for GPU 7 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000 [default6]:dgx045:217076:217685 [6] NCCL INFO Setting affinity for GPU 6 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000 [default7]:dgx046:1564845:1565723 [7] NCCL INFO Setting affinity for GPU 7 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000 [default2]:dgx046:1564836:1565727 [2] NCCL INFO Setting affinity for GPU 2 to ffff0000,00000000,00000000,00000000,ffff0000 [default3]:dgx046:1564837:1565729 [3] NCCL INFO Setting affinity for GPU 3 to ffff0000,00000000,00000000,00000000,ffff0000 [default6]:dgx046:1564841:1565722 [6] NCCL INFO Setting affinity for GPU 6 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000 [default0]:dgx046:1564834:1565726 [0] NCCL INFO Trees [0] 9/-1/-1->8->0 [1] 9/-1/-1->8->15 [2] 9/0/-1->8->-1 [3] 9/-1/-1->8->15 [default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 00/0 : 7[bd000] -> 8[7000] [receive] via NET/IB/0 [default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 02/0 : 7[bd000] -> 8[7000] [receive] via NET/IB/0 [default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 00/0 : 8[7000] -> 9[f000] via P2P/IPC/read [default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 02/0 : 8[7000] -> 9[f000] via P2P/IPC/read [default4]:dgx046:1564838:1565725 [4] NCCL INFO Trees [0] 13/-1/-1->12->11 [1] 13/-1/-1->12->11 [2] 13/-1/-1->12->11 [3] 13/-1/-1->12->11 [default1]:dgx046:1564835:1565728 [1] NCCL INFO Trees [0] 10/-1/-1->9->8 [1] -1/-1/-1->9->8 [2] 10/-1/-1->9->8 [3] -1/-1/-1->9->8 [default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 00/0 : 12[87000] -> 13[90000] via P2P/IPC/read [default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 02/0 : 12[87000] -> 13[90000] via P2P/IPC/read [default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 00/0 : 9[f000] -> 10[47000] via P2P/IPC/read [default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 02/0 : 9[f000] -> 10[47000] via P2P/IPC/read [default4]:dgx045:217074:217684 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [default4]:dgx045:217074:217684 [4] NCCL INFO Channel 00/0 : 4[87000] -> 5[90000] via P2P/IPC/read [default4]:dgx045:217074:217684 [4] NCCL INFO Channel 02/0 : 4[87000] -> 5[90000] via P2P/IPC/read [default0]:dgx045:217070:217667 [0] NCCL INFO Channel 00/04 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 [default0]:dgx045:217070:217667 [0] NCCL INFO Channel 01/04 : 0 7 6 5 1 3 4 10 8 15 14 13 9 11 12 2 [default0]:dgx045:217070:217667 [0] NCCL INFO Channel 02/04 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 [default0]:dgx045:217070:217667 [0] NCCL INFO Channel 03/04 : 0 7 6 5 1 3 4 10 8 15 14 13 9 11 12 2 [default0]:dgx045:217070:217667 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->7 [2] 1/-1/-1->0->8 [3] 1/-1/-1->0->7 [default0]:dgx045:217070:217667 [0] NCCL INFO Channel 00/0 : 15[bd000] -> 0[7000] [receive] via NET/IB/0 [default0]:dgx045:217070:217667 [0] NCCL INFO Channel 02/0 : 15[bd000] -> 0[7000] [receive] via NET/IB/0 [default0]:dgx045:217070:217667 [0] NCCL INFO Channel 00/0 : 0[7000] -> 1[f000] via P2P/IPC/read [default0]:dgx045:217070:217667 [0] NCCL INFO Channel 02/0 : 0[7000] -> 1[f000] via P2P/IPC/read [default1]:dgx045:217071:217690 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] -1/-1/-1->1->0 [default1]:dgx045:217071:217690 [1] NCCL INFO Channel 00/0 : 1[f000] -> 2[47000] via P2P/IPC/read [default1]:dgx045:217071:217690 [1] NCCL INFO Channel 02/0 : 1[f000] -> 2[47000] via P2P/IPC/read [default2]:dgx045:217072:217681 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/10/-1->2->-1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->10 [default2]:dgx045:217072:217681 [2] NCCL INFO Channel 00/0 : 2[47000] -> 3[4e000] via P2P/IPC/read [default2]:dgx045:217072:217681 [2] NCCL INFO Channel 02/0 : 2[47000] -> 3[4e000] via P2P/IPC/read [default3]:dgx045:217073:217689 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [default3]:dgx045:217073:217689 [3] NCCL INFO Channel 00/0 : 3[4e000] -> 4[87000] via P2P/IPC/read [default3]:dgx045:217073:217689 [3] NCCL INFO Channel 01/0 : 3[4e000] -> 4[87000] via P2P/IPC/read [default3]:dgx045:217073:217689 [3] NCCL INFO Channel 02/0 : 3[4e000] -> 4[87000] via P2P/IPC/read [default3]:dgx045:217073:217689 [3] NCCL INFO Channel 03/0 : 3[4e000] -> 4[87000] via P2P/IPC/read [default5]:dgx045:217075:217682 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [default5]:dgx045:217075:217682 [5] NCCL INFO Channel 00/0 : 5[90000] -> 6[b7000] via P2P/IPC/read [default5]:dgx045:217075:217682 [5] NCCL INFO Channel 02/0 : 5[90000] -> 6[b7000] via P2P/IPC/read [default7]:dgx045:217077:217680 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] 0/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] 0/-1/-1->7->6 [default7]:dgx045:217077:217680 [7] NCCL INFO Channel 00/0 : 7[bd000] -> 8[7000] [send] via NET/IB/8 [default7]:dgx045:217077:217680 [7] NCCL INFO Channel 02/0 : 7[bd000] -> 8[7000] [send] via NET/IB/8 [default6]:dgx045:217076:217685 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [default6]:dgx045:217076:217685 [6] NCCL INFO Channel 00/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read [default6]:dgx045:217076:217685 [6] NCCL INFO Channel 02/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read [default5]:dgx046:1564840:1565724 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12 [2] 14/-1/-1->13->12 [3] 14/-1/-1->13->12 [default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 00/0 : 13[90000] -> 14[b7000] via P2P/IPC/read [default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 02/0 : 13[90000] -> 14[b7000] via P2P/IPC/read [default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 01/0 : 13[90000] -> 9[f000] via P2P/IPC/read [default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 03/0 : 13[90000] -> 9[f000] via P2P/IPC/read [default7]:dgx046:1564845:1565723 [7] NCCL INFO Trees [0] -1/-1/-1->15->14 [1] 8/-1/-1->15->14 [2] -1/-1/-1->15->14 [3] 8/-1/-1->15->14 [default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 00/0 : 15[bd000] -> 0[7000] [send] via NET/IB/8 [default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 02/0 : 15[bd000] -> 0[7000] [send] via NET/IB/8 [default2]:dgx046:1564836:1565727 [2] NCCL INFO Trees [0] 11/-1/-1->10->9 [1] 11/-1/-1->10->2 [2] 11/-1/-1->10->9 [3] 11/2/-1->10->-1 [default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 00/0 : 10[47000] -> 11[4e000] via P2P/IPC/read [default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 02/0 : 10[47000] -> 11[4e000] via P2P/IPC/read [default3]:dgx046:1564837:1565729 [3] NCCL INFO Trees [0] 12/-1/-1->11->10 [1] 12/-1/-1->11->10 [2] 12/-1/-1->11->10 [3] 12/-1/-1->11->10 [default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 00/0 : 11[4e000] -> 12[87000] via P2P/IPC/read [default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 01/0 : 11[4e000] -> 12[87000] via P2P/IPC/read [default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 02/0 : 11[4e000] -> 12[87000] via P2P/IPC/read [default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 03/0 : 11[4e000] -> 12[87000] via P2P/IPC/read [default6]:dgx046:1564841:1565722 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13 [2] 15/-1/-1->14->13 [3] 15/-1/-1->14->13 [default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 00/0 : 14[b7000] -> 15[bd000] via P2P/IPC/read [default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 02/0 : 14[b7000] -> 15[bd000] via P2P/IPC/read [default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 01/0 : 8[7000] -> 15[bd000] via P2P/IPC/read [default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 03/0 : 8[7000] -> 15[bd000] via P2P/IPC/read [default0]:dgx046:1564834:1565726 [0] NCCL INFO Connected all rings [default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 01/0 : 8[7000] -> 9[f000] via P2P/IPC/read [default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 03/0 : 8[7000] -> 9[f000] via P2P/IPC/read [default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 00/0 : 0[7000] -> 8[7000] [receive] via NET/IB/4 [default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 02/0 : 0[7000] -> 8[7000] [receive] via NET/IB/4 [default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 00/0 : 8[7000] -> 0[7000] [send] via NET/IB/4 [default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 02/0 : 8[7000] -> 0[7000] [send] via NET/IB/4 [default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 01/0 : 12[87000] -> 2[47000] [send] via NET/IB/5 [default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 01/0 : 9[f000] -> 11[4e000] via P2P/IPC/read [default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 03/0 : 12[87000] -> 2[47000] [send] via NET/IB/5 [default4]:dgx046:1564838:1565725 [4] NCCL INFO Connected all rings [default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 03/0 : 9[f000] -> 11[4e000] via P2P/IPC/read [default1]:dgx046:1564835:1565728 [1] NCCL INFO Connected all rings [default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 00/0 : 9[f000] -> 8[7000] via P2P/IPC/read [default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 01/0 : 12[87000] -> 13[90000] via P2P/IPC/read [default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 03/0 : 12[87000] -> 13[90000] via P2P/IPC/read [default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 00/0 : 12[87000] -> 11[4e000] via P2P/IPC/read [default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 01/0 : 9[f000] -> 8[7000] via P2P/IPC/read [default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 02/0 : 9[f000] -> 8[7000] via P2P/IPC/read [default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 03/0 : 9[f000] -> 8[7000] via P2P/IPC/read [default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 01/0 : 12[87000] -> 11[4e000] via P2P/IPC/read [default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 02/0 : 12[87000] -> 11[4e000] via P2P/IPC/read [default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 03/0 : 12[87000] -> 11[4e000] via P2P/IPC/read [default4]:dgx046:1564838:1565725 [4] NCCL INFO Connected all trees [default4]:dgx046:1564838:1565725 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 [default4]:dgx046:1564838:1565725 [4] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer [default4]:dgx045:217074:217684 [4] NCCL INFO Channel 01/0 : 4[87000] -> 10[47000] [send] via NET/IB/5 [default4]:dgx045:217074:217684 [4] NCCL INFO Channel 03/0 : 4[87000] -> 10[47000] [send] via NET/IB/5 [default4]:dgx045:217074:217684 [4] NCCL INFO Connected all rings [default4]:dgx045:217074:217684 [4] NCCL INFO Channel 01/0 : 4[87000] -> 5[90000] via P2P/IPC/read [default4]:dgx045:217074:217684 [4] NCCL INFO Channel 03/0 : 4[87000] -> 5[90000] via P2P/IPC/read [default4]:dgx045:217074:217684 [4] NCCL INFO Channel 00/0 : 4[87000] -> 3[4e000] via P2P/IPC/read [default4]:dgx045:217074:217684 [4] NCCL INFO Channel 01/0 : 4[87000] -> 3[4e000] via P2P/IPC/read [default4]:dgx045:217074:217684 [4] NCCL INFO Channel 02/0 : 4[87000] -> 3[4e000] via P2P/IPC/read [default4]:dgx045:217074:217684 [4] NCCL INFO Channel 03/0 : 4[87000] -> 3[4e000] via P2P/IPC/read [default4]:dgx045:217074:217684 [4] NCCL INFO Connected all trees [default4]:dgx045:217074:217684 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 [default4]:dgx045:217074:217684 [4] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer [default0]:dgx045:217070:217667 [0] NCCL INFO Channel 01/0 : 0[7000] -> 7[bd000] via P2P/IPC/read [default0]:dgx045:217070:217667 [0] NCCL INFO Channel 03/0 : 0[7000] -> 7[bd000] via P2P/IPC/read [default0]:dgx045:217070:217667 [0] NCCL INFO Connected all rings [default0]:dgx045:217070:217667 [0] NCCL INFO Channel 01/0 : 0[7000] -> 1[f000] via P2P/IPC/read [default0]:dgx045:217070:217667 [0] NCCL INFO Channel 03/0 : 0[7000] -> 1[f000] via P2P/IPC/read [default0]:dgx045:217070:217667 [0] NCCL INFO Channel 00/0 : 8[7000] -> 0[7000] [receive] via NET/IB/4 [default0]:dgx045:217070:217667 [0] NCCL INFO Channel 02/0 : 8[7000] -> 0[7000] [receive] via NET/IB/4 [default0]:dgx045:217070:217667 [0] NCCL INFO Channel 00/0 : 0[7000] -> 8[7000] [send] via NET/IB/4 [default0]:dgx045:217070:217667 [0] NCCL INFO Channel 02/0 : 0[7000] -> 8[7000] [send] via NET/IB/4 [default1]:dgx045:217071:217690 [1] NCCL INFO Channel 01/0 : 1[f000] -> 3[4e000] via P2P/IPC/read [default1]:dgx045:217071:217690 [1] NCCL INFO Channel 03/0 : 1[f000] -> 3[4e000] via P2P/IPC/read [default1]:dgx045:217071:217690 [1] NCCL INFO Connected all rings [default1]:dgx045:217071:217690 [1] NCCL INFO Channel 00/0 : 1[f000] -> 0[7000] via P2P/IPC/read [default1]:dgx045:217071:217690 [1] NCCL INFO Channel 01/0 : 1[f000] -> 0[7000] via P2P/IPC/read [default1]:dgx045:217071:217690 [1] NCCL INFO Channel 02/0 : 1[f000] -> 0[7000] via P2P/IPC/read [default1]:dgx045:217071:217690 [1] NCCL INFO Channel 03/0 : 1[f000] -> 0[7000] via P2P/IPC/read [default2]:dgx045:217072:217681 [2] NCCL INFO Channel 01/0 : 12[87000] -> 2[47000] [receive] via NET/IB/2 [default2]:dgx045:217072:217681 [2] NCCL INFO Channel 03/0 : 12[87000] -> 2[47000] [receive] via NET/IB/2 [default2]:dgx045:217072:217681 [2] NCCL INFO Channel 01/0 : 2[47000] -> 0[7000] via P2P/IPC/read [default2]:dgx045:217072:217681 [2] NCCL INFO Channel 03/0 : 2[47000] -> 0[7000] via P2P/IPC/read [default2]:dgx045:217072:217681 [2] NCCL INFO Connected all rings [default2]:dgx045:217072:217681 [2] NCCL INFO Channel 01/0 : 2[47000] -> 3[4e000] via P2P/IPC/read [default2]:dgx045:217072:217681 [2] NCCL INFO Channel 03/0 : 2[47000] -> 3[4e000] via P2P/IPC/read [default3]:dgx045:217073:217689 [3] NCCL INFO Connected all rings [default3]:dgx045:217073:217689 [3] NCCL INFO Channel 00/0 : 3[4e000] -> 2[47000] via P2P/IPC/read [default3]:dgx045:217073:217689 [3] NCCL INFO Channel 01/0 : 3[4e000] -> 2[47000] via P2P/IPC/read [default3]:dgx045:217073:217689 [3] NCCL INFO Channel 02/0 : 3[4e000] -> 2[47000] via P2P/IPC/read [default3]:dgx045:217073:217689 [3] NCCL INFO Channel 03/0 : 3[4e000] -> 2[47000] via P2P/IPC/read [default7]:dgx045:217077:217680 [7] NCCL INFO Channel 01/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read [default6]:dgx045:217076:217685 [6] NCCL INFO Channel 01/0 : 6[b7000] -> 5[90000] via P2P/IPC/read [default6]:dgx045:217076:217685 [6] NCCL INFO Channel 03/0 : 6[b7000] -> 5[90000] via P2P/IPC/read [default7]:dgx045:217077:217680 [7] NCCL INFO Channel 03/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read [default7]:dgx045:217077:217680 [7] NCCL INFO Connected all rings [default7]:dgx045:217077:217680 [7] NCCL INFO Channel 01/0 : 7[bd000] -> 0[7000] via P2P/IPC/read [default6]:dgx045:217076:217685 [6] NCCL INFO Connected all rings [default7]:dgx045:217077:217680 [7] NCCL INFO Channel 03/0 : 7[bd000] -> 0[7000] via P2P/IPC/read [default6]:dgx045:217076:217685 [6] NCCL INFO Channel 01/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read [default5]:dgx045:217075:217682 [5] NCCL INFO Channel 01/0 : 5[90000] -> 1[f000] via P2P/IPC/read [default6]:dgx045:217076:217685 [6] NCCL INFO Channel 03/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read [default6]:dgx045:217076:217685 [6] NCCL INFO Channel 00/0 : 6[b7000] -> 5[90000] via P2P/IPC/read [default5]:dgx045:217075:217682 [5] NCCL INFO Channel 03/0 : 5[90000] -> 1[f000] via P2P/IPC/read [default5]:dgx045:217075:217682 [5] NCCL INFO Connected all rings [default5]:dgx045:217075:217682 [5] NCCL INFO Channel 01/0 : 5[90000] -> 6[b7000] via P2P/IPC/read [default6]:dgx045:217076:217685 [6] NCCL INFO Channel 02/0 : 6[b7000] -> 5[90000] via P2P/IPC/read [default5]:dgx045:217075:217682 [5] NCCL INFO Channel 03/0 : 5[90000] -> 6[b7000] via P2P/IPC/read [default5]:dgx045:217075:217682 [5] NCCL INFO Channel 00/0 : 5[90000] -> 4[87000] via P2P/IPC/read [default5]:dgx045:217075:217682 [5] NCCL INFO Channel 01/0 : 5[90000] -> 4[87000] via P2P/IPC/read [default5]:dgx045:217075:217682 [5] NCCL INFO Channel 02/0 : 5[90000] -> 4[87000] via P2P/IPC/read [default5]:dgx045:217075:217682 [5] NCCL INFO Channel 03/0 : 5[90000] -> 4[87000] via P2P/IPC/read [default5]:dgx045:217075:217682 [5] NCCL INFO Connected all trees [default5]:dgx045:217075:217682 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 [default5]:dgx045:217075:217682 [5] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer [default5]:dgx046:1564840:1565724 [5] NCCL INFO Connected all rings [default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 01/0 : 15[bd000] -> 14[b7000] via P2P/IPC/read [default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 03/0 : 15[bd000] -> 14[b7000] via P2P/IPC/read [default7]:dgx046:1564845:1565723 [7] NCCL INFO Connected all rings [default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 01/0 : 13[90000] -> 14[b7000] via P2P/IPC/read [default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 03/0 : 13[90000] -> 14[b7000] via P2P/IPC/read [default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 01/0 : 15[bd000] -> 8[7000] via P2P/IPC/read [default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 03/0 : 15[bd000] -> 8[7000] via P2P/IPC/read [default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 00/0 : 13[90000] -> 12[87000] via P2P/IPC/read [default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 01/0 : 13[90000] -> 12[87000] via P2P/IPC/read [default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 02/0 : 13[90000] -> 12[87000] via P2P/IPC/read [default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 03/0 : 13[90000] -> 12[87000] via P2P/IPC/read [default5]:dgx046:1564840:1565724 [5] NCCL INFO Connected all trees [default5]:dgx046:1564840:1565724 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 [default5]:dgx046:1564840:1565724 [5] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer [default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 01/0 : 4[87000] -> 10[47000] [receive] via NET/IB/2 [default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 03/0 : 4[87000] -> 10[47000] [receive] via NET/IB/2 [default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 01/0 : 10[47000] -> 8[7000] via P2P/IPC/read [default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 03/0 : 10[47000] -> 8[7000] via P2P/IPC/read [default2]:dgx046:1564836:1565727 [2] NCCL INFO Connected all rings [default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 01/0 : 10[47000] -> 11[4e000] via P2P/IPC/read [default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 03/0 : 10[47000] -> 11[4e000] via P2P/IPC/read [default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 01/0 : 2[47000] -> 10[47000] [receive] via NET/IB/9 [default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 03/0 : 2[47000] -> 10[47000] [receive] via NET/IB/9 [default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 01/0 : 10[47000] -> 2[47000] [send] via NET/IB/9 [default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 03/0 : 10[47000] -> 2[47000] [send] via NET/IB/9 [default3]:dgx046:1564837:1565729 [3] NCCL INFO Connected all rings [default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 00/0 : 11[4e000] -> 10[47000] via P2P/IPC/read [default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 01/0 : 11[4e000] -> 10[47000] via P2P/IPC/read [default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 02/0 : 11[4e000] -> 10[47000] via P2P/IPC/read [default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 03/0 : 11[4e000] -> 10[47000] via P2P/IPC/read [default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 01/0 : 14[b7000] -> 13[90000] via P2P/IPC/read [default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 03/0 : 14[b7000] -> 13[90000] via P2P/IPC/read [default6]:dgx046:1564841:1565722 [6] NCCL INFO Connected all rings [default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 01/0 : 14[b7000] -> 15[bd000] via P2P/IPC/read [default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 03/0 : 14[b7000] -> 15[bd000] via P2P/IPC/read [default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 00/0 : 14[b7000] -> 13[90000] via P2P/IPC/read [default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 02/0 : 14[b7000] -> 13[90000] via P2P/IPC/read [default0]:dgx046:1564834:1565726 [0] NCCL INFO Connected all trees [default0]:dgx046:1564834:1565726 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 [default0]:dgx046:1564834:1565726 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer [default1]:dgx046:1564835:1565728 [1] NCCL INFO Connected all trees [default1]:dgx046:1564835:1565728 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 [default1]:dgx046:1564835:1565728 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer [default0]:dgx045:217070:217667 [0] NCCL INFO Connected all trees [default0]:dgx045:217070:217667 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 [default0]:dgx045:217070:217667 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer [default1]:dgx045:217071:217690 [1] NCCL INFO Connected all trees [default1]:dgx045:217071:217690 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 [default1]:dgx045:217071:217690 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer [default2]:dgx045:217072:217681 [2] NCCL INFO Channel 01/0 : 10[47000] -> 2[47000] [receive] via NET/IB/9 [default2]:dgx045:217072:217681 [2] NCCL INFO Channel 03/0 : 10[47000] -> 2[47000] [receive] via NET/IB/9 [default2]:dgx045:217072:217681 [2] NCCL INFO Channel 01/0 : 2[47000] -> 10[47000] [send] via NET/IB/9 [default2]:dgx045:217072:217681 [2] NCCL INFO Channel 03/0 : 2[47000] -> 10[47000] [send] via NET/IB/9 [default2]:dgx045:217072:217681 [2] NCCL INFO Channel 00/0 : 2[47000] -> 1[f000] via P2P/IPC/read [default2]:dgx045:217072:217681 [2] NCCL INFO Channel 02/0 : 2[47000] -> 1[f000] via P2P/IPC/read [default2]:dgx045:217072:217681 [2] NCCL INFO Connected all trees [default2]:dgx045:217072:217681 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 [default2]:dgx045:217072:217681 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer [default3]:dgx045:217073:217689 [3] NCCL INFO Connected all trees [default3]:dgx045:217073:217689 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 [default3]:dgx045:217073:217689 [3] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer [default7]:dgx045:217077:217680 [7] NCCL INFO Channel 00/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read [default7]:dgx045:217077:217680 [7] NCCL INFO Channel 02/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read [default7]:dgx045:217077:217680 [7] NCCL INFO Connected all trees [default7]:dgx045:217077:217680 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 [default7]:dgx045:217077:217680 [7] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer [default6]:dgx045:217076:217685 [6] NCCL INFO Connected all trees [default6]:dgx045:217076:217685 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 [default6]:dgx045:217076:217685 [6] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer [default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 00/0 : 15[bd000] -> 14[b7000] via P2P/IPC/read [default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 02/0 : 15[bd000] -> 14[b7000] via P2P/IPC/read [default7]:dgx046:1564845:1565723 [7] NCCL INFO Connected all trees [default7]:dgx046:1564845:1565723 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 [default7]:dgx046:1564845:1565723 [7] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer [default7]:dgx046:1564845:1565723 [7] NCCL INFO comm 0x1361d4f0 rank 15 nranks 16 cudaDev 7 busId bd000 - Init COMPLETE [default5]:dgx046:1564840:1565724 [5] NCCL INFO comm 0x13626f30 rank 13 nranks 16 cudaDev 5 busId 90000 - Init COMPLETE [default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 00/0 : 10[47000] -> 9[f000] via P2P/IPC/read [default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 02/0 : 10[47000] -> 9[f000] via P2P/IPC/read [default2]:dgx046:1564836:1565727 [2] NCCL INFO Connected all trees [default2]:dgx046:1564836:1565727 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 [default2]:dgx046:1564836:1565727 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer [default2]:dgx046:1564836:1565727 [2] NCCL INFO comm 0x12c889b0 rank 10 nranks 16 cudaDev 2 busId 47000 - Init COMPLETE [default3]:dgx046:1564837:1565729 [3] NCCL INFO Connected all trees [default3]:dgx046:1564837:1565729 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 [default3]:dgx046:1564837:1565729 [3] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer [default3]:dgx046:1564837:1565729 [3] NCCL INFO comm 0x13bfe5e0 rank 11 nranks 16 cudaDev 3 busId 4e000 - Init COMPLETE [default6]:dgx046:1564841:1565722 [6] NCCL INFO Connected all trees [default6]:dgx046:1564841:1565722 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 [default6]:dgx046:1564841:1565722 [6] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer [default6]:dgx046:1564841:1565722 [6] NCCL INFO comm 0x136ab020 rank 14 nranks 16 cudaDev 6 busId b7000 - Init COMPLETE [default0]:dgx046:1564834:1565726 [0] NCCL INFO comm 0x13327b50 rank 8 nranks 16 cudaDev 0 busId 7000 - Init COMPLETE [default1]:dgx046:1564835:1565728 [1] NCCL INFO comm 0x142c6490 rank 9 nranks 16 cudaDev 1 busId f000 - Init COMPLETE [default4]:dgx046:1564838:1565725 [4] NCCL INFO comm 0x13a85850 rank 12 nranks 16 cudaDev 4 busId 87000 - Init COMPLETE [default4]:dgx045:217074:217684 [4] NCCL INFO comm 0x129fd1b0 rank 4 nranks 16 cudaDev 4 busId 87000 - Init COMPLETE [default0]:dgx045:217070:217667 [0] NCCL INFO comm 0x1300fd70 rank 0 nranks 16 cudaDev 0 busId 7000 - Init COMPLETE [default1]:dgx045:217071:217690 [1] NCCL INFO comm 0x12854cb0 rank 1 nranks 16 cudaDev 1 busId f000 - Init COMPLETE [default2]:dgx045:217072:217681 [2] NCCL INFO comm 0x12e8f850 rank 2 nranks 16 cudaDev 2 busId 47000 - Init COMPLETE [default3]:dgx045:217073:217689 [3] NCCL INFO comm 0x1273c450 rank 3 nranks 16 cudaDev 3 busId 4e000 - Init COMPLETE [default7]:dgx045:217077:217680 [7] NCCL INFO comm 0x144e2870 rank 7 nranks 16 cudaDev 7 busId bd000 - Init COMPLETE [default5]:dgx045:217075:217682 [5] NCCL INFO comm 0x12ea1b00 rank 5 nranks 16 cudaDev 5 busId 90000 - Init COMPLETE [default6]:dgx045:217076:217685 [6] NCCL INFO comm 0x1420d660 rank 6 nranks 16 cudaDev 6 busId b7000 - Init COMPLETE
No response
The text was updated successfully, but these errors were encountered:
遇到类似的问题,请问解决了吗?
Sorry, something went wrong.
No branches or pull requests
Reminder
Reproduction
[default0]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default1]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default4]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default4]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default0]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default1]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default2]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default7]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default3]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default6]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default5]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default5]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default7]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default2]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default3]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default6]:[2024-05-01 10:26:58,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[default0]:[2024-05-01 10:27:12,374] [INFO] [comm.py:637:init_distributed] cdb=None
[default1]:[2024-05-01 10:27:12,371] [INFO] [comm.py:637:init_distributed] cdb=None
[default4]:[2024-05-01 10:27:12,373] [INFO] [comm.py:637:init_distributed] cdb=None
[default4]:[2024-05-01 10:27:12,373] [INFO] [comm.py:637:init_distributed] cdb=None
[default0]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None
[default0]:[2024-05-01 10:27:12,372] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[default2]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None
[default1]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None
[default5]:[2024-05-01 10:27:12,373] [INFO] [comm.py:637:init_distributed] cdb=None
[default3]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None
[default7]:[2024-05-01 10:27:12,373] [INFO] [comm.py:637:init_distributed] cdb=None
[default6]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None
[default7]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None
[default5]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None
[default2]:[2024-05-01 10:27:12,372] [INFO] [comm.py:637:init_distributed] cdb=None
[default3]:[2024-05-01 10:27:12,371] [INFO] [comm.py:637:init_distributed] cdb=None
[default6]:[2024-05-01 10:27:12,371] [INFO] [comm.py:637:init_distributed] cdb=None
[default3]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default5]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 5, device: cuda:5, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default7]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 7, device: cuda:7, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default0]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default1]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default4]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 4, device: cuda:4, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default0]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default1]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default2]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default6]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 6, device: cuda:6, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,770 >> loading file tokenizer.json
[default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,770 >> loading file added_tokens.json
[default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,770 >> loading file special_tokens_map.json
[default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,770 >> loading file tokenizer_config.json
[default7]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 7, device: cuda:7, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default5]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 5, device: cuda:5, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,810 >> loading file tokenizer.json
[default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,810 >> loading file added_tokens.json
[default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,810 >> loading file special_tokens_map.json
[default0]:[INFO|tokenization_utils_base.py:2044] 2024-05-01 10:27:12,810 >> loading file tokenizer_config.json
[default2]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default3]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default4]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 4, device: cuda:4, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default6]:05/01/2024 10:27:12 - INFO - llmtuner.hparams.parser - Process rank: 6, device: cuda:6, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[default0]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default0]:05/01/2024 10:27:13 - INFO - llmtuner.data.loader - Loading dataset /platform_tech/zhuhan/Datasets/stage2/stg2_train_uni_format_4w.json...
[default0]:05/01/2024 10:27:13 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
[default3]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default5]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default3]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default5]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default7]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default7]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default0]:[WARNING|logging.py:314] 2024-05-01 10:27:13,048 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default0]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default0]:05/01/2024 10:27:13 - INFO - llmtuner.data.loader - Loading dataset /platform_tech/zhuhan/Datasets/stage2/stg2_train_uni_format_4w.json...
[default0]:05/01/2024 10:27:13 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
[default4]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default1]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default1]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default4]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default1]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default2]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default6]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default4]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default6]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default5]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default2]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default1]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default7]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default2]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default4]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default7]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default5]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default0]:[WARNING|logging.py:314] 2024-05-01 10:27:13,127 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default2]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default3]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default6]:Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[default3]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default6]:05/01/2024 10:27:13 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
[default0]:dgx046:1564834:1564834 [0] NCCL INFO cudaDriverVersion 12020
[default4]:dgx046:1564838:1564838 [4] NCCL INFO cudaDriverVersion 12020
[default1]:dgx046:1564835:1564835 [1] NCCL INFO cudaDriverVersion 12020
[default4]:dgx045:217074:217074 [4] NCCL INFO cudaDriverVersion 12020
[default0]:dgx045:217070:217070 [0] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0>
[default0]:dgx045:217070:217070 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default0]:dgx045:217070:217070 [0] NCCL INFO cudaDriverVersion 12020
[default0]:NCCL version 2.14.3+cuda11.8
[default1]:dgx045:217071:217071 [1] NCCL INFO cudaDriverVersion 12020
[default2]:dgx045:217072:217072 [2] NCCL INFO cudaDriverVersion 12020
[default5]:dgx045:217075:217075 [5] NCCL INFO cudaDriverVersion 12020
[default3]:dgx045:217073:217073 [3] NCCL INFO cudaDriverVersion 12020
[default6]:dgx045:217076:217076 [6] NCCL INFO cudaDriverVersion 12020
[default7]:dgx045:217077:217077 [7] NCCL INFO cudaDriverVersion 12020
[default7]:dgx046:1564845:1564845 [7] NCCL INFO cudaDriverVersion 12020
[default5]:dgx046:1564840:1564840 [5] NCCL INFO cudaDriverVersion 12020
[default2]:dgx046:1564836:1564836 [2] NCCL INFO cudaDriverVersion 12020
[default3]:dgx046:1564837:1564837 [3] NCCL INFO cudaDriverVersion 12020
[default6]:dgx046:1564841:1564841 [6] NCCL INFO cudaDriverVersion 12020
[default4]:dgx045:217074:217074 [4] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0>
[default4]:dgx045:217074:217074 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default0]:dgx045:217070:217667 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0>
[default0]:dgx045:217070:217667 [0] NCCL INFO Using network IB
[default1]:dgx045:217071:217071 [1] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0>
[default1]:dgx045:217071:217071 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default2]:dgx045:217072:217072 [2] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0>
[default2]:dgx045:217072:217072 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default3]:dgx045:217073:217073 [3] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0>
[default5]:dgx045:217075:217075 [5] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0>
[default3]:dgx045:217073:217073 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default5]:dgx045:217075:217075 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default7]:dgx045:217077:217077 [7] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0>
[default6]:dgx045:217076:217076 [6] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.45<0>
[default7]:dgx045:217077:217077 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default6]:dgx045:217076:217076 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default0]:dgx046:1564834:1564834 [0] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0>
[default0]:dgx046:1564834:1564834 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default1]:dgx046:1564835:1564835 [1] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0>
[default4]:dgx046:1564838:1564838 [4] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0>
[default1]:dgx046:1564835:1564835 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default4]:dgx046:1564838:1564838 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default4]:dgx045:217074:217684 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0>
[default4]:dgx045:217074:217684 [4] NCCL INFO Using network IB
[default1]:dgx045:217071:217690 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0>
[default1]:dgx045:217071:217690 [1] NCCL INFO Using network IB
[default2]:dgx045:217072:217681 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0>
[default2]:dgx045:217072:217681 [2] NCCL INFO Using network IB
[default3]:dgx045:217073:217689 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0>
[default3]:dgx045:217073:217689 [3] NCCL INFO Using network IB
[default5]:dgx045:217075:217682 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0>
[default5]:dgx045:217075:217682 [5] NCCL INFO Using network IB
[default7]:dgx045:217077:217680 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0>
[default7]:dgx045:217077:217680 [7] NCCL INFO Using network IB
[default6]:dgx045:217076:217685 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.45<0>
[default6]:dgx045:217076:217685 [6] NCCL INFO Using network IB
[default7]:dgx046:1564845:1564845 [7] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0>
[default7]:dgx046:1564845:1564845 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default7]:dgx046:1564845:1565723 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0>
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Using network IB
[default5]:dgx046:1564840:1564840 [5] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0>
[default5]:dgx046:1564840:1564840 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default5]:dgx046:1564840:1565724 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0>
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Using network IB
[default2]:dgx046:1564836:1564836 [2] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0>
[default2]:dgx046:1564836:1564836 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default2]:dgx046:1564836:1565727 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0>
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Using network IB
[default3]:dgx046:1564837:1564837 [3] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0>
[default3]:dgx046:1564837:1564837 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default3]:dgx046:1564837:1565729 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0>
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Using network IB
[default6]:dgx046:1564841:1564841 [6] NCCL INFO Bootstrap : Using ibp97s0f0:10.10.10.46<0>
[default6]:dgx046:1564841:1564841 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[default0]:dgx046:1564834:1565726 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0>
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Using network IB
[default1]:dgx046:1564835:1565728 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0>
[default4]:dgx046:1564838:1565725 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0>
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Using network IB
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Using network IB
[default6]:dgx046:1564841:1565722 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [9]mlx5_10:1/IB [10]mlx5_11:1/RoCE [RO]; OOB ibp97s0f0:10.10.10.46<0>
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Using network IB
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
[default0]:dgx045:217070:217667 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
[default1]:dgx045:217071:217690 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
[default2]:dgx045:217072:217681 [2] NCCL INFO Setting affinity for GPU 2 to ffff0000,00000000,00000000,00000000,ffff0000
[default3]:dgx045:217073:217689 [3] NCCL INFO Setting affinity for GPU 3 to ffff0000,00000000,00000000,00000000,ffff0000
[default7]:dgx045:217077:217680 [7] NCCL INFO Setting affinity for GPU 7 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000
[default6]:dgx045:217076:217685 [6] NCCL INFO Setting affinity for GPU 6 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Setting affinity for GPU 7 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Setting affinity for GPU 2 to ffff0000,00000000,00000000,00000000,ffff0000
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Setting affinity for GPU 3 to ffff0000,00000000,00000000,00000000,ffff0000
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Setting affinity for GPU 6 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Trees [0] 9/-1/-1->8->0 [1] 9/-1/-1->8->15 [2] 9/0/-1->8->-1 [3] 9/-1/-1->8->15
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 00/0 : 7[bd000] -> 8[7000] [receive] via NET/IB/0
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 02/0 : 7[bd000] -> 8[7000] [receive] via NET/IB/0
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 00/0 : 8[7000] -> 9[f000] via P2P/IPC/read
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 02/0 : 8[7000] -> 9[f000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Trees [0] 13/-1/-1->12->11 [1] 13/-1/-1->12->11 [2] 13/-1/-1->12->11 [3] 13/-1/-1->12->11
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Trees [0] 10/-1/-1->9->8 [1] -1/-1/-1->9->8 [2] 10/-1/-1->9->8 [3] -1/-1/-1->9->8
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 00/0 : 12[87000] -> 13[90000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 02/0 : 12[87000] -> 13[90000] via P2P/IPC/read
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 00/0 : 9[f000] -> 10[47000] via P2P/IPC/read
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 02/0 : 9[f000] -> 10[47000] via P2P/IPC/read
[default4]:dgx045:217074:217684 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 00/0 : 4[87000] -> 5[90000] via P2P/IPC/read
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 02/0 : 4[87000] -> 5[90000] via P2P/IPC/read
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 00/04 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 01/04 : 0 7 6 5 1 3 4 10 8 15 14 13 9 11 12 2
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 02/04 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 03/04 : 0 7 6 5 1 3 4 10 8 15 14 13 9 11 12 2
[default0]:dgx045:217070:217667 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->7 [2] 1/-1/-1->0->8 [3] 1/-1/-1->0->7
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 00/0 : 15[bd000] -> 0[7000] [receive] via NET/IB/0
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 02/0 : 15[bd000] -> 0[7000] [receive] via NET/IB/0
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 00/0 : 0[7000] -> 1[f000] via P2P/IPC/read
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 02/0 : 0[7000] -> 1[f000] via P2P/IPC/read
[default1]:dgx045:217071:217690 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] -1/-1/-1->1->0
[default1]:dgx045:217071:217690 [1] NCCL INFO Channel 00/0 : 1[f000] -> 2[47000] via P2P/IPC/read
[default1]:dgx045:217071:217690 [1] NCCL INFO Channel 02/0 : 1[f000] -> 2[47000] via P2P/IPC/read
[default2]:dgx045:217072:217681 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/10/-1->2->-1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->10
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 00/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 02/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
[default3]:dgx045:217073:217689 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2
[default3]:dgx045:217073:217689 [3] NCCL INFO Channel 00/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
[default3]:dgx045:217073:217689 [3] NCCL INFO Channel 01/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
[default3]:dgx045:217073:217689 [3] NCCL INFO Channel 02/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
[default3]:dgx045:217073:217689 [3] NCCL INFO Channel 03/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 00/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 02/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
[default7]:dgx045:217077:217680 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] 0/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] 0/-1/-1->7->6
[default7]:dgx045:217077:217680 [7] NCCL INFO Channel 00/0 : 7[bd000] -> 8[7000] [send] via NET/IB/8
[default7]:dgx045:217077:217680 [7] NCCL INFO Channel 02/0 : 7[bd000] -> 8[7000] [send] via NET/IB/8
[default6]:dgx045:217076:217685 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5
[default6]:dgx045:217076:217685 [6] NCCL INFO Channel 00/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
[default6]:dgx045:217076:217685 [6] NCCL INFO Channel 02/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12 [2] 14/-1/-1->13->12 [3] 14/-1/-1->13->12
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 00/0 : 13[90000] -> 14[b7000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 02/0 : 13[90000] -> 14[b7000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 01/0 : 13[90000] -> 9[f000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 03/0 : 13[90000] -> 9[f000] via P2P/IPC/read
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Trees [0] -1/-1/-1->15->14 [1] 8/-1/-1->15->14 [2] -1/-1/-1->15->14 [3] 8/-1/-1->15->14
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 00/0 : 15[bd000] -> 0[7000] [send] via NET/IB/8
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 02/0 : 15[bd000] -> 0[7000] [send] via NET/IB/8
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Trees [0] 11/-1/-1->10->9 [1] 11/-1/-1->10->2 [2] 11/-1/-1->10->9 [3] 11/2/-1->10->-1
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 00/0 : 10[47000] -> 11[4e000] via P2P/IPC/read
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 02/0 : 10[47000] -> 11[4e000] via P2P/IPC/read
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Trees [0] 12/-1/-1->11->10 [1] 12/-1/-1->11->10 [2] 12/-1/-1->11->10 [3] 12/-1/-1->11->10
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 00/0 : 11[4e000] -> 12[87000] via P2P/IPC/read
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 01/0 : 11[4e000] -> 12[87000] via P2P/IPC/read
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 02/0 : 11[4e000] -> 12[87000] via P2P/IPC/read
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 03/0 : 11[4e000] -> 12[87000] via P2P/IPC/read
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13 [2] 15/-1/-1->14->13 [3] 15/-1/-1->14->13
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 00/0 : 14[b7000] -> 15[bd000] via P2P/IPC/read
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 02/0 : 14[b7000] -> 15[bd000] via P2P/IPC/read
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 01/0 : 8[7000] -> 15[bd000] via P2P/IPC/read
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 03/0 : 8[7000] -> 15[bd000] via P2P/IPC/read
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Connected all rings
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 01/0 : 8[7000] -> 9[f000] via P2P/IPC/read
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 03/0 : 8[7000] -> 9[f000] via P2P/IPC/read
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 00/0 : 0[7000] -> 8[7000] [receive] via NET/IB/4
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 02/0 : 0[7000] -> 8[7000] [receive] via NET/IB/4
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 00/0 : 8[7000] -> 0[7000] [send] via NET/IB/4
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Channel 02/0 : 8[7000] -> 0[7000] [send] via NET/IB/4
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 01/0 : 12[87000] -> 2[47000] [send] via NET/IB/5
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 01/0 : 9[f000] -> 11[4e000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 03/0 : 12[87000] -> 2[47000] [send] via NET/IB/5
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Connected all rings
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 03/0 : 9[f000] -> 11[4e000] via P2P/IPC/read
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Connected all rings
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 00/0 : 9[f000] -> 8[7000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 01/0 : 12[87000] -> 13[90000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 03/0 : 12[87000] -> 13[90000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 00/0 : 12[87000] -> 11[4e000] via P2P/IPC/read
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 01/0 : 9[f000] -> 8[7000] via P2P/IPC/read
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 02/0 : 9[f000] -> 8[7000] via P2P/IPC/read
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Channel 03/0 : 9[f000] -> 8[7000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 01/0 : 12[87000] -> 11[4e000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 02/0 : 12[87000] -> 11[4e000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Channel 03/0 : 12[87000] -> 11[4e000] via P2P/IPC/read
[default4]:dgx046:1564838:1565725 [4] NCCL INFO Connected all trees
[default4]:dgx046:1564838:1565725 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default4]:dgx046:1564838:1565725 [4] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 01/0 : 4[87000] -> 10[47000] [send] via NET/IB/5
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 03/0 : 4[87000] -> 10[47000] [send] via NET/IB/5
[default4]:dgx045:217074:217684 [4] NCCL INFO Connected all rings
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 01/0 : 4[87000] -> 5[90000] via P2P/IPC/read
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 03/0 : 4[87000] -> 5[90000] via P2P/IPC/read
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 00/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 01/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 02/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
[default4]:dgx045:217074:217684 [4] NCCL INFO Channel 03/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
[default4]:dgx045:217074:217684 [4] NCCL INFO Connected all trees
[default4]:dgx045:217074:217684 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default4]:dgx045:217074:217684 [4] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 01/0 : 0[7000] -> 7[bd000] via P2P/IPC/read
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 03/0 : 0[7000] -> 7[bd000] via P2P/IPC/read
[default0]:dgx045:217070:217667 [0] NCCL INFO Connected all rings
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 01/0 : 0[7000] -> 1[f000] via P2P/IPC/read
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 03/0 : 0[7000] -> 1[f000] via P2P/IPC/read
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 00/0 : 8[7000] -> 0[7000] [receive] via NET/IB/4
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 02/0 : 8[7000] -> 0[7000] [receive] via NET/IB/4
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 00/0 : 0[7000] -> 8[7000] [send] via NET/IB/4
[default0]:dgx045:217070:217667 [0] NCCL INFO Channel 02/0 : 0[7000] -> 8[7000] [send] via NET/IB/4
[default1]:dgx045:217071:217690 [1] NCCL INFO Channel 01/0 : 1[f000] -> 3[4e000] via P2P/IPC/read
[default1]:dgx045:217071:217690 [1] NCCL INFO Channel 03/0 : 1[f000] -> 3[4e000] via P2P/IPC/read
[default1]:dgx045:217071:217690 [1] NCCL INFO Connected all rings
[default1]:dgx045:217071:217690 [1] NCCL INFO Channel 00/0 : 1[f000] -> 0[7000] via P2P/IPC/read
[default1]:dgx045:217071:217690 [1] NCCL INFO Channel 01/0 : 1[f000] -> 0[7000] via P2P/IPC/read
[default1]:dgx045:217071:217690 [1] NCCL INFO Channel 02/0 : 1[f000] -> 0[7000] via P2P/IPC/read
[default1]:dgx045:217071:217690 [1] NCCL INFO Channel 03/0 : 1[f000] -> 0[7000] via P2P/IPC/read
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 01/0 : 12[87000] -> 2[47000] [receive] via NET/IB/2
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 03/0 : 12[87000] -> 2[47000] [receive] via NET/IB/2
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 01/0 : 2[47000] -> 0[7000] via P2P/IPC/read
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 03/0 : 2[47000] -> 0[7000] via P2P/IPC/read
[default2]:dgx045:217072:217681 [2] NCCL INFO Connected all rings
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 01/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 03/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
[default3]:dgx045:217073:217689 [3] NCCL INFO Connected all rings
[default3]:dgx045:217073:217689 [3] NCCL INFO Channel 00/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
[default3]:dgx045:217073:217689 [3] NCCL INFO Channel 01/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
[default3]:dgx045:217073:217689 [3] NCCL INFO Channel 02/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
[default3]:dgx045:217073:217689 [3] NCCL INFO Channel 03/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
[default7]:dgx045:217077:217680 [7] NCCL INFO Channel 01/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
[default6]:dgx045:217076:217685 [6] NCCL INFO Channel 01/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
[default6]:dgx045:217076:217685 [6] NCCL INFO Channel 03/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
[default7]:dgx045:217077:217680 [7] NCCL INFO Channel 03/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
[default7]:dgx045:217077:217680 [7] NCCL INFO Connected all rings
[default7]:dgx045:217077:217680 [7] NCCL INFO Channel 01/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
[default6]:dgx045:217076:217685 [6] NCCL INFO Connected all rings
[default7]:dgx045:217077:217680 [7] NCCL INFO Channel 03/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
[default6]:dgx045:217076:217685 [6] NCCL INFO Channel 01/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 01/0 : 5[90000] -> 1[f000] via P2P/IPC/read
[default6]:dgx045:217076:217685 [6] NCCL INFO Channel 03/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
[default6]:dgx045:217076:217685 [6] NCCL INFO Channel 00/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 03/0 : 5[90000] -> 1[f000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Connected all rings
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 01/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
[default6]:dgx045:217076:217685 [6] NCCL INFO Channel 02/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 03/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 00/0 : 5[90000] -> 4[87000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 01/0 : 5[90000] -> 4[87000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 02/0 : 5[90000] -> 4[87000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Channel 03/0 : 5[90000] -> 4[87000] via P2P/IPC/read
[default5]:dgx045:217075:217682 [5] NCCL INFO Connected all trees
[default5]:dgx045:217075:217682 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default5]:dgx045:217075:217682 [5] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Connected all rings
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 01/0 : 15[bd000] -> 14[b7000] via P2P/IPC/read
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 03/0 : 15[bd000] -> 14[b7000] via P2P/IPC/read
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Connected all rings
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 01/0 : 13[90000] -> 14[b7000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 03/0 : 13[90000] -> 14[b7000] via P2P/IPC/read
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 01/0 : 15[bd000] -> 8[7000] via P2P/IPC/read
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 03/0 : 15[bd000] -> 8[7000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 00/0 : 13[90000] -> 12[87000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 01/0 : 13[90000] -> 12[87000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 02/0 : 13[90000] -> 12[87000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Channel 03/0 : 13[90000] -> 12[87000] via P2P/IPC/read
[default5]:dgx046:1564840:1565724 [5] NCCL INFO Connected all trees
[default5]:dgx046:1564840:1565724 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default5]:dgx046:1564840:1565724 [5] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 01/0 : 4[87000] -> 10[47000] [receive] via NET/IB/2
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 03/0 : 4[87000] -> 10[47000] [receive] via NET/IB/2
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 01/0 : 10[47000] -> 8[7000] via P2P/IPC/read
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 03/0 : 10[47000] -> 8[7000] via P2P/IPC/read
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Connected all rings
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 01/0 : 10[47000] -> 11[4e000] via P2P/IPC/read
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 03/0 : 10[47000] -> 11[4e000] via P2P/IPC/read
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 01/0 : 2[47000] -> 10[47000] [receive] via NET/IB/9
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 03/0 : 2[47000] -> 10[47000] [receive] via NET/IB/9
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 01/0 : 10[47000] -> 2[47000] [send] via NET/IB/9
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 03/0 : 10[47000] -> 2[47000] [send] via NET/IB/9
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Connected all rings
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 00/0 : 11[4e000] -> 10[47000] via P2P/IPC/read
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 01/0 : 11[4e000] -> 10[47000] via P2P/IPC/read
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 02/0 : 11[4e000] -> 10[47000] via P2P/IPC/read
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Channel 03/0 : 11[4e000] -> 10[47000] via P2P/IPC/read
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 01/0 : 14[b7000] -> 13[90000] via P2P/IPC/read
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 03/0 : 14[b7000] -> 13[90000] via P2P/IPC/read
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Connected all rings
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 01/0 : 14[b7000] -> 15[bd000] via P2P/IPC/read
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 03/0 : 14[b7000] -> 15[bd000] via P2P/IPC/read
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 00/0 : 14[b7000] -> 13[90000] via P2P/IPC/read
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Channel 02/0 : 14[b7000] -> 13[90000] via P2P/IPC/read
[default0]:dgx046:1564834:1565726 [0] NCCL INFO Connected all trees
[default0]:dgx046:1564834:1565726 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default0]:dgx046:1564834:1565726 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default1]:dgx046:1564835:1565728 [1] NCCL INFO Connected all trees
[default1]:dgx046:1564835:1565728 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default1]:dgx046:1564835:1565728 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default0]:dgx045:217070:217667 [0] NCCL INFO Connected all trees
[default0]:dgx045:217070:217667 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default0]:dgx045:217070:217667 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default1]:dgx045:217071:217690 [1] NCCL INFO Connected all trees
[default1]:dgx045:217071:217690 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default1]:dgx045:217071:217690 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 01/0 : 10[47000] -> 2[47000] [receive] via NET/IB/9
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 03/0 : 10[47000] -> 2[47000] [receive] via NET/IB/9
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 01/0 : 2[47000] -> 10[47000] [send] via NET/IB/9
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 03/0 : 2[47000] -> 10[47000] [send] via NET/IB/9
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 00/0 : 2[47000] -> 1[f000] via P2P/IPC/read
[default2]:dgx045:217072:217681 [2] NCCL INFO Channel 02/0 : 2[47000] -> 1[f000] via P2P/IPC/read
[default2]:dgx045:217072:217681 [2] NCCL INFO Connected all trees
[default2]:dgx045:217072:217681 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default2]:dgx045:217072:217681 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default3]:dgx045:217073:217689 [3] NCCL INFO Connected all trees
[default3]:dgx045:217073:217689 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default3]:dgx045:217073:217689 [3] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default7]:dgx045:217077:217680 [7] NCCL INFO Channel 00/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
[default7]:dgx045:217077:217680 [7] NCCL INFO Channel 02/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
[default7]:dgx045:217077:217680 [7] NCCL INFO Connected all trees
[default7]:dgx045:217077:217680 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default7]:dgx045:217077:217680 [7] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default6]:dgx045:217076:217685 [6] NCCL INFO Connected all trees
[default6]:dgx045:217076:217685 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default6]:dgx045:217076:217685 [6] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 00/0 : 15[bd000] -> 14[b7000] via P2P/IPC/read
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Channel 02/0 : 15[bd000] -> 14[b7000] via P2P/IPC/read
[default7]:dgx046:1564845:1565723 [7] NCCL INFO Connected all trees
[default7]:dgx046:1564845:1565723 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default7]:dgx046:1564845:1565723 [7] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default7]:dgx046:1564845:1565723 [7] NCCL INFO comm 0x1361d4f0 rank 15 nranks 16 cudaDev 7 busId bd000 - Init COMPLETE
[default5]:dgx046:1564840:1565724 [5] NCCL INFO comm 0x13626f30 rank 13 nranks 16 cudaDev 5 busId 90000 - Init COMPLETE
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 00/0 : 10[47000] -> 9[f000] via P2P/IPC/read
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Channel 02/0 : 10[47000] -> 9[f000] via P2P/IPC/read
[default2]:dgx046:1564836:1565727 [2] NCCL INFO Connected all trees
[default2]:dgx046:1564836:1565727 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default2]:dgx046:1564836:1565727 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default2]:dgx046:1564836:1565727 [2] NCCL INFO comm 0x12c889b0 rank 10 nranks 16 cudaDev 2 busId 47000 - Init COMPLETE
[default3]:dgx046:1564837:1565729 [3] NCCL INFO Connected all trees
[default3]:dgx046:1564837:1565729 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default3]:dgx046:1564837:1565729 [3] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default3]:dgx046:1564837:1565729 [3] NCCL INFO comm 0x13bfe5e0 rank 11 nranks 16 cudaDev 3 busId 4e000 - Init COMPLETE
[default6]:dgx046:1564841:1565722 [6] NCCL INFO Connected all trees
[default6]:dgx046:1564841:1565722 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
[default6]:dgx046:1564841:1565722 [6] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
[default6]:dgx046:1564841:1565722 [6] NCCL INFO comm 0x136ab020 rank 14 nranks 16 cudaDev 6 busId b7000 - Init COMPLETE
[default0]:dgx046:1564834:1565726 [0] NCCL INFO comm 0x13327b50 rank 8 nranks 16 cudaDev 0 busId 7000 - Init COMPLETE
[default1]:dgx046:1564835:1565728 [1] NCCL INFO comm 0x142c6490 rank 9 nranks 16 cudaDev 1 busId f000 - Init COMPLETE
[default4]:dgx046:1564838:1565725 [4] NCCL INFO comm 0x13a85850 rank 12 nranks 16 cudaDev 4 busId 87000 - Init COMPLETE
[default4]:dgx045:217074:217684 [4] NCCL INFO comm 0x129fd1b0 rank 4 nranks 16 cudaDev 4 busId 87000 - Init COMPLETE
[default0]:dgx045:217070:217667 [0] NCCL INFO comm 0x1300fd70 rank 0 nranks 16 cudaDev 0 busId 7000 - Init COMPLETE
[default1]:dgx045:217071:217690 [1] NCCL INFO comm 0x12854cb0 rank 1 nranks 16 cudaDev 1 busId f000 - Init COMPLETE
[default2]:dgx045:217072:217681 [2] NCCL INFO comm 0x12e8f850 rank 2 nranks 16 cudaDev 2 busId 47000 - Init COMPLETE
[default3]:dgx045:217073:217689 [3] NCCL INFO comm 0x1273c450 rank 3 nranks 16 cudaDev 3 busId 4e000 - Init COMPLETE
[default7]:dgx045:217077:217680 [7] NCCL INFO comm 0x144e2870 rank 7 nranks 16 cudaDev 7 busId bd000 - Init COMPLETE
[default5]:dgx045:217075:217682 [5] NCCL INFO comm 0x12ea1b00 rank 5 nranks 16 cudaDev 5 busId 90000 - Init COMPLETE
[default6]:dgx045:217076:217685 [6] NCCL INFO comm 0x1420d660 rank 6 nranks 16 cudaDev 6 busId b7000 - Init COMPLETE
Expected behavior
No response
System Info
No response
Others
No response
The text was updated successfully, but these errors were encountered: