ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 20) of binary.... #13

bozhenhhu · 2022-12-12T12:07:54Z

When I run
python -m torch.distributed.launch --nproc_per_node=4 script/downstream.py -c config/downstream/GO-BP/gearnet_edge.yaml --gpus [0,1,2,3] --ckpt
on worker*1 Tesla-V100-SXM2-32GB:4 GPU, 47 CPU, I got the error:

[219013] [E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804901 milliseconds before timing out.
[219014] [E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out.
[219015] [E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805985 milliseconds before timing out.
[219016] Traceback (most recent call last):
[219017] File "/hubozhen/GearNet/script/downstream.py", line 75, in
[219018] train_and_validate(cfg, solver, scheduler)
[219019] File "/hubozhen/GearNet/script/downstream.py", line 30, in train_and_validate
[219020] solver.train(**kwargs)
[219021] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/core/engine.py", line 155, in train
[219022] loss, metric = model(batch)
[219023] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219024] return forward_call(*input, **kwargs)
[219025] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
[219026] output = self.module(*inputs[0], **kwargs[0])
[219027] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219028] return forward_call(*input, **kwargs)
[219029] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/tasks/property_prediction.py", line 279, in forward
[219030] pred = self.predict(batch, all_loss, metric)
[219031] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/tasks/property_prediction.py", line 300, in predict
[219032] output = self.model(graph, graph.node_feature.float(), all_loss=all_loss, metric=metric)
[219033] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219034] return forward_call(*input, **kwargs)
[219035] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/models/gearnet.py", line 99, in forward
[219036] edge_hidden = self.edge_layers[i](line_graph, edge_input)
[219037] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219038] return forward_call(*input, **kwargs)
[219039] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/conv.py", line 92, in forward
[219040] output = self.combine(input, update)
[219041] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/conv.py", line 438, in combine
[219042] output = self.batch_norm(output)
[219043] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
[219044] return forward_call(*input, **kwargs)
[219045] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 758, in forward
[219046] world_size,
[219047] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/nn/modules/_functions.py", line 42, in forward
[219048] dist._all_gather_base(combined_flat, combined, process_group, async_op=False)
[219049] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2070, in _all_gather_base
[219050] work = group._allgather_base(output_tensor, input_tensor)
[219051] RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804901 milliseconds before timing out.
[219052] /opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/functional/functional.py:474: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
[219053] index1 = local_index // local_inner_size + offset1
[219054] /opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/layers/functional/functional.py:474: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
[219055] index1 = local_index // local_inner_size + offset1
[219056] [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[219057] terminate called after throwing an instance of 'std::runtime_error'
[219058] what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804901 milliseconds before timing out.
[219059] [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[219060] terminate called after throwing an instance of 'std::runtime_error'
[219061] what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out.
[219062] /opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torchdrug/data/graph.py:1667: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
[219063] edge_in_index = local_index // local_inner_size + edge_in_offset
[219064] [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[219065] terminate called after throwing an instance of 'std::runtime_error'
[219066] what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805985 milliseconds before timing out.
[219067] WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 21 closing signal SIGTERM
[219068] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 20) of binary: /opt/anaconda3/envs/manifold/bin/python
[219069] Traceback (most recent call last):
[219070] File "/opt/anaconda3/envs/manifold/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[219071] "main", mod_spec)
[219072] File "/opt/anaconda3/envs/manifold/lib/python3.7/runpy.py", line 85, in _run_code
[219073] exec(code, run_globals)
[219074] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
[219075] main()
[219076] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
[219077] launch(args)
[219078] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
[219079] run(args)
[219080] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
[219081] )(*cmd_args)
[219082] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
[219083] return launch_agent(self._config, self._entrypoint, list(args))
[219084] File "/opt/anaconda3/envs/manifold/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
[219085] failures=result.failures,
[219086] torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
[219087] ===================================================
[219088] /hubozhen/GearNet/script/downstream.py FAILED
[219089] ---------------------------------------------------
[219090] Failures:
[219091] [1]:
[219092] time : 2022-12-12_09:41:02
[219093] host : pytorch-7c3c96f1-d9hcm
[219094] rank : 2 (local_rank: 2)
[219095] exitcode : -6 (pid: 22)
[219096] error_file: <N/A>
[219097] traceback : Signal 6 (SIGABRT) received by PID 22
[219098] [2]:
[219099] time : 2022-12-12_09:41:02
[219100] host : pytorch-7c3c96f1-d9hcm
[219101] rank : 3 (local_rank: 3)
[219102] exitcode : -6 (pid: 23)
[219103] error_file: <N/A>
[219104] traceback : Signal 6 (SIGABRT) received by PID 23
[219105] ---------------------------------------------------
[219106] Root Cause (first observed failure):
[219107] [0]:
[219108] time : 2022-12-12_09:41:02
[219109] host : pytorch-7c3c96f1-d9hcm
[219110] rank : 0 (local_rank: 0)
[219111] exitcode : -6 (pid: 20)
[219112] error_file: <N/A>
[219113] traceback : Signal 6 (SIGABRT) received by PID 20
[219114] ===================================================

Someone said this happened when loading big data, I find the use ratios of these for GPUs are 100%.
However, I changed the same procedure on another V100 mechaine (worker*1:
Tesla-V100-SXM-32GB:4 GPU, 48 CPU,), it is OK.
It confused me.

Oxer11 · 2022-12-13T14:45:45Z

Hi! I'm sorry that I am not familiar with elastic and don't know the reason for your problem. Maybe you could try to use the typical DDP in torch instead of elastic?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 20) of binary.... #13

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 20) of binary.... #13

bozhenhhu commented Dec 12, 2022

Oxer11 commented Dec 13, 2022

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 20) of binary.... #13

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 20) of binary.... #13

Comments

bozhenhhu commented Dec 12, 2022

Oxer11 commented Dec 13, 2022