Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training exit code 0 even if CUDA errors #146

Open
subdavis opened this issue Aug 31, 2021 · 0 comments
Open

Training exit code 0 even if CUDA errors #146

subdavis opened this issue Aug 31, 2021 · 0 comments

Comments

@subdavis
Copy link
Contributor

ERROR: an <class 'RuntimeError'> error occurred in the train loop: RuntimeError('CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.80 GiB total capacity; 1.28 GiB already allocated; 5.44 MiB free; 1.32 GiB reserved in total by PyTorch)',)
INFO: Traceback (most recent call last):
  File "/opt/noaa/viame/lib/python3.6/site-packages/netharn/fit_harn.py", line 1554, in run
    test_loader
  File "/opt/noaa/viame/lib/python3.6/site-packages/netharn/fit_harn.py", line 1768, in _run_tagged_epochs
    harn._run_epoch(train_loader, tag='train', learn=True)
  File "/opt/noaa/viame/lib/python3.6/site-packages/netharn/fit_harn.py", line 1981, in _run_epoch
    outputs, loss = harn.run_batch(batch)
  File "/opt/noaa/viame/lib/python3.6/site-packages/bioharn/detect_fit.py", line 293, in run_batch
    return_result=return_result)
  File "/opt/noaa/viame/lib/python3.6/site-packages/netharn/device.py", line 58, in forward
    return self.module.forward(*inputs, **kwargs)
  File "/opt/noaa/viame/lib/python3.6/site-packages/bioharn/models/mm_models.py", line 871, in forward
    return_loss=True, **trainkw)
  File "/opt/noaa/viame/lib/python3.6/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func
    return old_func(*args, **kwargs)
  File "/opt/noaa/viame/lib/python3.6/site-packages/mmdet/models/detectors/base.py", line 168, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/opt/noaa/viame/lib/python3.6/site-packages/mmdet/models/detectors/two_stage.py", line 142, in forward_train
    x = self.extract_feat(img)
  File "/opt/noaa/viame/lib/python3.6/site-packages/mmdet/models/detectors/two_stage.py", line 82, in extract_feat
    x = self.backbone(img)
  File "/opt/noaa/viame/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/noaa/viame/lib/python3.6/site-packages/mmdet/models/backbones/resnet.py", line 635, in forward
    x = res_layer(x)
  File "/opt/noaa/viame/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/noaa/viame/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/opt/noaa/viame/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/noaa/viame/lib/python3.6/site-packages/mmdet/models/backbones/resnet.py", line 296, in forward
    out = _inner_forward(x)
  File "/opt/noaa/viame/lib/python3.6/site-packages/mmdet/models/backbones/resnet.py", line 280, in _inner_forward
    out = self.conv3(out)
  File "/opt/noaa/viame/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/noaa/viame/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 423, in forward
    return self._conv_forward(input, self.weight)
  File "/opt/noaa/viame/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.80 GiB total capacity; 1.28 GiB already allocated; 5.44 MiB free; 1.32 GiB reserved in total by PyTorch)

Process exits with code 0, job looks like it succeeded when it didnt/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant