Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于单机单卡遇上的报错 #83

Open
yang-stephen opened this issue Apr 4, 2022 · 3 comments
Open

关于单机单卡遇上的报错 #83

yang-stephen opened this issue Apr 4, 2022 · 3 comments

Comments

@yang-stephen
Copy link

你好,非常想知道单卡单机训练时该怎么修改代码,自己尝试直接单卡运行,在此处报错
Traceback (most recent call last):
File "train.py", line 429, in
train()
File "train.py", line 292, in train
out, out16, out32, detail8 = net(im)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 885, in forward
inputs, kwargs = self.to_kwargs(inputs, kwargs, self.device_ids[0])
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 992, in to_kwargs
inputs = self._recursive_to(inputs, device_id) if inputs else []
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 986, in _recursive_to
res = to_map(inputs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 977, in to_map
return list(zip(*map(to_map, obj)))
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 955, in to_map
if obj.device == torch.device("cuda", target_gpu):
RuntimeError: Device index must not be negative

将local_rank默认值改为0后又有如下报错:
Traceback (most recent call last):
File "train.py", line 429, in
train()
File "train.py", line 292, in train
out, out16, out32, detail8 = net(im)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/STDC-Seg/models/model_stages.py", line 272, in forward
feat_res2, feat_res4, feat_res8, feat_res16, feat_cp8, feat_cp16 = self.cp(x)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/STDC-Seg/models/model_stages.py", line 141, in forward
avg = self.conv_avg(avg)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/STDC-Seg/models/model_stages.py", line 31, in forward
x = self.bn(x)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/STDC-Seg/modules/bn.py", line 118, in forward
return inplace_abn_sync(x, self.weight, self.bias, self.running_mean, self.running_var,
RuntimeError: Some elements marked as dirty during the forward method were not returned as output. The inputs that are modified inplace must all be outputs of the Function.

不知道该怎么解决,非常想知道怎么解决,或者希望能指点要去掉分布式训练该怎么修train.py,万分感谢!

@Lee6384
Copy link

Lee6384 commented Oct 16, 2022

你好,请问你最后解决了这个问题,在单卡单机上训练成功了没有?

@870572761
Copy link

有英文回答解决了这个问题

@LingsiDS
Copy link

LingsiDS commented Mar 9, 2023

RuntimeError: Some elements marked as dirty during the forward method were not returned as output. The inputs that are modified inplace must all be outputs of the Function.
@yang-stephen @Lee6384

报上述错误,BatchNorm2d函数不对,建议使用torch官方正则化函数,即nn.BatchNorm2d (model_stages.py里)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants