Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error running on multiple GPUs: torch.cuda.nccl.NcclError: System Error (2) #12

Open
adrianalbert opened this issue Oct 26, 2017 · 0 comments

Comments

@adrianalbert
Copy link

Hi,

I've been trying to run the example code (on the maps dataset):

python main.py --dataset=maps --num_gpu=4

I get the error below related to the NCCL library. I'm trying to run this on 4 K80 GPUs.

Any suggestions on what could be causing this and what a solution could be?

pix2pix processing: 100%|#######################| 1096/1096 [00:00<00:00, 178591.97it/s]
pix2pix processing: 100%|#######################| 1096/1096 [00:00<00:00, 213732.43it/s]
[] MODEL dir: logs/maps_2017-10-26_20-36-34
[
] PARAM path: logs/maps_2017-10-26_20-36-34/params.json
0%| | 0/500000 [00:00<?, ?it/s]

Traceback (most recent call last):
File "main.py", line 41, in
main(config)
File "main.py", line 33, in main
trainer.train()
File "/home/nbserver/DiscoGAN-pytorch/trainer.py", line 193, in train
x_AB = self.G_AB(x_A).detach()
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 224, in
call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line
59, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line
64, in replicate
return replicate(module, device_ids)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/replicate.py", line 12,
in replicate
param_copies = Broadcast(devices)(*params)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/_functions.py", line 19
, in forward
outputs = comm.broadcast_coalesced(inputs, self.target_gpus)
File "/usr/local/lib/python2.7/dist-packages/torch/cuda/comm.py", line 54, in broadcas
t_coalesced
results = broadcast(_flatten_tensors(chunk), devices)
File "/usr/local/lib/python2.7/dist-packages/torch/cuda/comm.py", line 24, in broadcas
t
nccl.broadcast(tensors)
File "/usr/local/lib/python2.7/dist-packages/torch/cuda/nccl.py", line 182, in broadca
st
comm = communicator(inputs)
File "/usr/local/lib/python2.7/dist-packages/torch/cuda/nccl.py", line 133, in communi
cator
_communicators[key] = NcclCommList(devices)
File "/usr/local/lib/python2.7/dist-packages/torch/cuda/nccl.py", line 106, in _init
_
check_error(lib.ncclCommInitAll(self, len(devices), int_array(devices)))
File "/usr/local/lib/python2.7/dist-packages/torch/cuda/nccl.py", line 118, in check_e
rror
raise NcclError(status)
torch.cuda.nccl.NcclError: System Error (2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant