Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: The shape of the mask [32, 8732] at index 0 does not match the shape of the indexed tensor [279424, 1] at index 0 #173

Open
17764591637 opened this issue Jun 4, 2018 · 54 comments

Comments

@17764591637
Copy link

rps@rps:~/桌面/ssd.pytorch$ python3 train.py
/home/rps/桌面/ssd.pytorch/ssd.py:34: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
self.priors = Variable(self.priorbox.forward(), volatile=True)
/home/rps/桌面/ssd.pytorch/layers/modules/l2norm.py:17: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_.
init.constant(self.weight,self.gamma)
Loading base network...
Initializing weights...
train.py:214: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
init.xavier_uniform(param)
Loading the dataset...
Training SSD on: VOC0712
Using the specified args:
Namespace(basenet='vgg16_reducedfc.pth', batch_size=32, cuda=True, dataset='VOC', dataset_root='/home/rps/data/VOCdevkit/', gamma=0.1, lr=0.001, momentum=0.9, num_workers=4, resume=None, save_folder='weights/', start_iter=0, visdom=False, weight_decay=0.0005)
train.py:169: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
targets = [Variable(ann.cuda(), volatile=True) for ann in targets]
Traceback (most recent call last):
File "train.py", line 255, in
train()
File "train.py", line 178, in train
loss_l, loss_c = criterion(out, targets)
File "/home/rps/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/rps/桌面/ssd.pytorch/layers/modules/multibox_loss.py", line 97, in forward
loss_c[pos] = 0 # filter out pos boxes for now
RuntimeError: The shape of the mask [32, 8732] at index 0 does not match the shape of the indexed tensor [279424, 1] at index 0

anyone helps,please...

@isaactalx
Copy link

I have the same error.Using Pytorch0.4+python3.5.

@bobo0810
Copy link

bobo0810 commented Jun 7, 2018

python3.5 and pytorch 0.3.0 no problem

@xscjun
Copy link

xscjun commented Jun 7, 2018

I have the same error,if I switch the lines 96,97
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0
in multibox_loss.py, this error disappear.
But come with another error :
"File "/home/.../ssd.pytorch/layers/modules/multibox_loss.py", line 115, in forward
loss_l /= N
RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #3 'other'"
The type of tensor is not match, how can I fix it ?

@slomrafgrav
Copy link

slomrafgrav commented Jun 8, 2018

@xscjun change line:
N = num_pos.data.sum()
to:
N = num_pos.data.sum().double()
loss_l = loss_l.double()
loss_c = loss_c.double()
this should work

@gtwell
Copy link

gtwell commented Jul 25, 2018

Anyone has solved this problem? help me tks.

@Lin-Zhipeng
Copy link

Lin-Zhipeng commented Sep 27, 2018

I have the same error,if I switch the lines 96,97
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0
in multibox_loss.py, this error disappear.
But come with another error :
"File "/home/.../ssd.pytorch/layers/modules/multibox_loss.py", line 115, in forward
loss_l /= N
RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #3 'other'"
The type of tensor is not match, how can I fix it ?

The “pos” -> torch.Size([32, 8732])
The “loss_c ” ->torch.Size([279424, 1])
when I add one line as :

        loss_c = loss_c.view(pos.size()[0], pos.size()[1]) #add line 
        loss_c[pos] = 0  # filter out pos boxes for now
        loss_c = loss_c.view(num, -1)

Then it worked.

@zxt-triumph
Copy link

I have the same error,if I switch the lines 96,97
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0
in multibox_loss.py, this error disappear.
But come with another error :
"File "/home/.../ssd.pytorch/layers/modules/multibox_loss.py", line 115, in forward
loss_l /= N
RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #3 'other'"
The type of tensor is not match, how can I fix it ?

i have the same error, and how did you solve it finally?

@zxt-triumph
Copy link

I have the same error,if I switch the lines 96,97
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0
in multibox_loss.py, this error disappear.
But come with another error :
"File "/home/.../ssd.pytorch/layers/modules/multibox_loss.py", line 115, in forward
loss_l /= N
RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #3 'other'"
The type of tensor is not match, how can I fix it ?

i have the same error, so how could you figure it out finally?

@matthewarthur
Copy link

What file should be updated?

@queryor
Copy link

queryor commented Nov 12, 2018

I have the same error,if I switch the lines 96,97
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0
in multibox_loss.py, this error disappear.
But come with another error :
"File "/home/.../ssd.pytorch/layers/modules/multibox_loss.py", line 115, in forward
loss_l /= N
RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #3 'other'"
The type of tensor is not match, how can I fix it ?

change the data type of N to FloatTensor.

@usherbob
Copy link

What file should be updated?

You may try to update your file /home/.../ssd.pytorch/layers/modules/multibox_loss.py, and add one line as @LZP4GitHub said above.

@subicWang
Copy link

@usherbob python3.6+pytorch0.4.1, I added "loss_c = loss_c.view(pos.size()[0], pos.size()[1]) #add line", but I have another issue. RuntimeError: copy_if failed to synchronize: device-side assert triggered

@subicWang
Copy link

subicWang commented Nov 14, 2018

Finally, I succeeded.
step1: switch the two lines 97,98:
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0 # filter out pos boxes for now

step2: change the line144 N = num_pos.data.sum() to
N = num_pos.data.sum().double()
loss_l = loss_l.double()
loss_c = loss_c.double()

@CJJ-717
Copy link

CJJ-717 commented Dec 14, 2018

Finally, I succeeded.
step1: switch the two lines 97,98:
loss_c = loss_c.view(num, -1) loss_c[pos] = 0 # filter out pos boxes for now
step2: change the line144 N = num_pos.data.sum() to
N = num_pos.data.sum().double() loss_l = loss_l.double() loss_c = loss_c.double()

I changed like this, but there was a RuntimeError still:
RuntimeError: device-side assert triggered
How can I fix it ? Looking forward to your reply.Thank you!

@wisdomk
Copy link

wisdomk commented Mar 22, 2019

by changing the order of line 97 and 98 it throws a new error for me

Traceback (most recent call last):
  File "train.py", line 254, in <module>
    train()
  File "train.py", line 182, in train
    loc_loss += loss_l.data[0]
IndexError: invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number

any suggestions?

PS: I tried as well converting the loss to double as mentioned above and still the same error!


### solved
apparently 'loss_l.data[0]' should be replaced with 'loss_l.item()' instead
this replacement applies on every loss_x.data[0] in the file!

@leaf918
Copy link

leaf918 commented Mar 26, 2019

Finally, I succeeded.
step1: switch the two lines 97,98:
loss_c = loss_c.view(num, -1) loss_c[pos] = 0 # filter out pos boxes for now
step2: change the line144 N = num_pos.data.sum() to
N = num_pos.data.sum().double() loss_l = loss_l.double() loss_c = loss_c.double()

很棒,但是有个小bug,是line 114,不是line 144

@TianSong1991
Copy link

If your Python torch version is '0.4.1' ,you can change follow
step1: switch the two lines 97,98:
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0 # filter out pos boxes for now
step2: change the line114 N = num_pos.data.sum() to
N = num_pos.data.sum().double()
loss_l = loss_l.double()
loss_c = loss_c.double()
But if your python torch version is 1.0.1,that change is no useful.

@TianSong1991
Copy link

I solve the problem if your python torch version is 1.0.1. The solution as follow 1-3 steps:
step1 and step2 change the multibox_loss.py!
step1: switch the two lines 97,98:
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0 # filter out pos boxes for now
step2: change the line114 N = num_pos.data.sum() to
N = num_pos.data.sum().double()
loss_l = loss_l.double()
loss_c = loss_c.double()
setp 3 change the train.py!
step3: change the line188,189,193,196:
loss_l.data[0] >> loss_l.data
loss_c.data[0] >> loss_c.data
loss.data[0] >> loss.data

@charan1561
Copy link

loss is increasing as shown below

timer: 2.2050 sec.
iter 0 || Loss: 153.4730 || timer: 1.8316 sec.
iter 10 || Loss: 48.9679 || timer: 1.8920 sec.
iter 20 || Loss: 191.8098 || timer: 2.0969 sec.
iter 30 || Loss: 110.8081 || timer: 1.8849 sec.
iter 40 || Loss: 106.9749 || timer: 1.9373 sec.
iter 50 || Loss: 134.3674 || timer: 2.0012 sec.
.
.

help me to solve the issue.

@litianciucas
Copy link

I solve the problem if your python torch version is 1.0.1. The solution as follow 1-3 steps:
step1 and step2 change the multibox_loss.py!
step1: switch the two lines 97,98:
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0 # filter out pos boxes for now
step2: change the line114 N = num_pos.data.sum() to
N = num_pos.data.sum().double()
loss_l = loss_l.double()
loss_c = loss_c.double()
setp 3 change the train.py! step3: change the line188,189,193,196: loss_l.data[0] >> loss_l.data loss_c.data[0] >> loss_c.data loss.data[0] >> loss.data

thanks,that is usefully for me,but ,step3 is:line 183,184,188,191, 5 item ,loss_x.data[0] >> loss_x.data or loss.data[0] >> loss.data

@blueardour
Copy link

would be loss_x.data[0] >> loss_x.item() better?

@espectre
Copy link

@TianSong1991 Thanks a lot.Pytorch 1.0+Python 3.5 success!

@zz10001
Copy link

zz10001 commented May 9, 2019

PS: I tried as well converting the loss to double as mentioned above and still the same error!

much obligated!

@mk123qwe
Copy link

I solve the problem if your python torch version is 1.0.1. The solution as follow 1-3 steps:
step1 and step2 change the multibox_loss.py!
step1: switch the two lines 97,98:
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0 # filter out pos boxes for now
step2: change the line114 N = num_pos.data.sum() to
N = num_pos.data.sum().double()
loss_l = loss_l.double()
loss_c = loss_c.double()
setp 3 change the train.py! step3: change the line188,189,193,196: loss_l.data[0] >> loss_l.data loss_c.data[0] >> loss_c.data loss.data[0] >> loss.data

but loss is nan

@mk123qwe
Copy link

@TianSong1991 Thanks a lot.Pytorch 1.0+Python 3.5 success!
but loss is nan

@xafarranxera
Copy link

I solve the problem if your python torch version is 1.0.1. The solution as follow 1-3 steps:
step1 and step2 change the multibox_loss.py!
step1: switch the two lines 97,98:
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0 # filter out pos boxes for now
step2: change the line114 N = num_pos.data.sum() to
N = num_pos.data.sum().double()
loss_l = loss_l.double()
loss_c = loss_c.double()
setp 3 change the train.py! step3: change the line188,189,193,196: loss_l.data[0] >> loss_l.data loss_c.data[0] >> loss_c.data loss.data[0] >> loss.data

but loss is nan

I have the same problem. Why loss is nan?

@OberstWB
Copy link

If your Python torch version is '0.4.1' ,you can change follow
step1: switch the two lines 97,98:
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0 # filter out pos boxes for now
step2: change the line114 N = num_pos.data.sum() to
N = num_pos.data.sum().double()
loss_l = loss_l.double()
loss_c = loss_c.double()
But if your python torch version is 1.0.1,that change is no useful.

Hi , why don`t the loss_l divide by N?

@HaoWu1993
Copy link

Pytorch version:

>>> import torch
>>> print(torch.__version__)
1.1.0

Python version:

Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux

multibox_loss.py:

Switch the two lines 97,98:
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0 # filter out pos boxes for now
Change line114 
N = num_pos.data.sum() -> N = num_pos.data.sum().double()
and change the following two lines to: 
loss_l = loss_l.double()
loss_c = loss_c.double()

train.py

loss_l.data[0] >> loss_l.data 
loss_c.data[0] >> loss_c.data 
loss.data[0] >> loss.data

And here is my output:

timer: 11.9583 sec.
iter 0 || Loss: 11728.9388 || timer: 0.2955 sec.
iter 10 || Loss: nan || timer: 0.2843 sec.
iter 20 || Loss: nan || timer: 0.2890 sec.
iter 30 || Loss: nan || timer: 0.2934 sec.
iter 40 || Loss: nan || timer: 0.2865 sec.
iter 50 || Loss: nan || timer: 0.2855 sec.
iter 60 || Loss: nan || timer: 0.2889 sec.
iter 70 || Loss: nan || timer: 0.2857 sec.
iter 80 || Loss: nan || timer: 0.2843 sec.
iter 90 || Loss: nan || timer: 0.2835 sec.
iter 100 || Loss: nan || timer: 0.2846 sec.
iter 110 || Loss: nan || timer: 0.2946 sec.
iter 120 || Loss: nan || timer: 0.2860 sec.
iter 130 || Loss: nan || timer: 0.2846 sec.
iter 140 || Loss: nan || timer: 0.2962 sec.
iter 150 || Loss: nan || timer: 0.2989 sec.
iter 160 || Loss: nan || timer: 0.2857 sec.

I've encountered the same one here, have you solve this problem?

@gtwell
Copy link

gtwell commented Sep 22, 2019 via email

@Billnut
Copy link

Billnut commented Oct 5, 2019

Pytorch version:

>>> import torch
>>> print(torch.__version__)
1.1.0

Python version:

Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux

multibox_loss.py:

Switch the two lines 97,98:
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0 # filter out pos boxes for now
Change line114 
N = num_pos.data.sum() -> N = num_pos.data.sum().double()
and change the following two lines to: 
loss_l = loss_l.double()
loss_c = loss_c.double()

train.py

loss_l.data[0] >> loss_l.data 
loss_c.data[0] >> loss_c.data 
loss.data[0] >> loss.data

And here is my output:

timer: 11.9583 sec.
iter 0 || Loss: 11728.9388 || timer: 0.2955 sec.
iter 10 || Loss: nan || timer: 0.2843 sec.
iter 20 || Loss: nan || timer: 0.2890 sec.
iter 30 || Loss: nan || timer: 0.2934 sec.
iter 40 || Loss: nan || timer: 0.2865 sec.
iter 50 || Loss: nan || timer: 0.2855 sec.
iter 60 || Loss: nan || timer: 0.2889 sec.
iter 70 || Loss: nan || timer: 0.2857 sec.
iter 80 || Loss: nan || timer: 0.2843 sec.
iter 90 || Loss: nan || timer: 0.2835 sec.
iter 100 || Loss: nan || timer: 0.2846 sec.
iter 110 || Loss: nan || timer: 0.2946 sec.
iter 120 || Loss: nan || timer: 0.2860 sec.
iter 130 || Loss: nan || timer: 0.2846 sec.
iter 140 || Loss: nan || timer: 0.2962 sec.
iter 150 || Loss: nan || timer: 0.2989 sec.
iter 160 || Loss: nan || timer: 0.2857 sec.

I think the loss is much enormous, you should add two lines:
loss_l /= N
loss_c /= N

@mengxingkong
Copy link

Pytorch version:

>>> import torch
>>> print(torch.__version__)
1.1.0

Python version:

Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux

multibox_loss.py:

Switch the two lines 97,98:
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0 # filter out pos boxes for now
Change line114 
N = num_pos.data.sum() -> N = num_pos.data.sum().double()
and change the following two lines to: 
loss_l = loss_l.double()
loss_c = loss_c.double()

train.py

loss_l.data[0] >> loss_l.data 
loss_c.data[0] >> loss_c.data 
loss.data[0] >> loss.data

And here is my output:

timer: 11.9583 sec.
iter 0 || Loss: 11728.9388 || timer: 0.2955 sec.
iter 10 || Loss: nan || timer: 0.2843 sec.
iter 20 || Loss: nan || timer: 0.2890 sec.
iter 30 || Loss: nan || timer: 0.2934 sec.
iter 40 || Loss: nan || timer: 0.2865 sec.
iter 50 || Loss: nan || timer: 0.2855 sec.
iter 60 || Loss: nan || timer: 0.2889 sec.
iter 70 || Loss: nan || timer: 0.2857 sec.
iter 80 || Loss: nan || timer: 0.2843 sec.
iter 90 || Loss: nan || timer: 0.2835 sec.
iter 100 || Loss: nan || timer: 0.2846 sec.
iter 110 || Loss: nan || timer: 0.2946 sec.
iter 120 || Loss: nan || timer: 0.2860 sec.
iter 130 || Loss: nan || timer: 0.2846 sec.
iter 140 || Loss: nan || timer: 0.2962 sec.
iter 150 || Loss: nan || timer: 0.2989 sec.
iter 160 || Loss: nan || timer: 0.2857 sec.

I've encountered the same one here, have you solve this problem?

I don't change line 114, and then nan loss disappears.

@Billnut
Copy link

Billnut commented Oct 28, 2019 via email

@haibochina
Copy link

这些值 损失:loc_loss,conf_loss远远超出内存,您可以利用以下代码:N = num_pos.data.sum()。double()&nbsp; &nbsp; &nbsp; &nbsp; loss_l = loss_l.double()&nbsp; &nbsp; &nbsp; &nbsp; loss_c = loss_c.double()&nbsp; &nbsp; &nbsp; &nbsp; loss_l / = N&nbsp; &nbsp; &nbsp; &nbsp; loss_c / = N并且在train.py上,您应该使用following&nbsp; 用两行代码代替loc_loss + = loss_l.item()conf_loss + = loss_c.item()&nbsp; 祝你好运,好运,好运。&nbsp;

------------------&nbsp;原始邮件&nbsp; ------------------发件人:“琉璃梦” notifications@github.com ;; 发送时间:2019年10月18日(星期五)晚上10:01收件人:“ amdegroot / ssd.pytorch” ssd.pytorch@noreply.github.com ;; 抄送:“ YUXIAOHONG” 353826721@qq.com ;; “评论” comment@noreply.github.com ;; 主题:Re:[amdegroot / ssd.pytorch] RuntimeError:索引0处的蒙版[32,8732]的形状与索引0处的索引张量[279424,1]的形状不匹配(#173)Pytorch版本:&gt;&gt;&gt; 导入割炬&gt;&gt;&gt; 在Linux上的print(torch .__ version__)1.1.0 Python版本:Python 3.6.7(默认,2018年10月22日,11:32:17)[GCC 8.2.0] multibox_loss.py:切换两行97,98:loss_c = loss_c.view(num,-1)loss_c [pos] = 0#现在过滤掉pos盒更改第114行N = num_pos.data.sum()-> N = num_pos.data.sum()。double(),并将以下两行更改为:loss_l = loss_l.double()loss_c = loss_c.double()train.py loss_l.data [0]&gt;&gt; loss_l.data loss_c.data [0]&gt;&gt; loss_c.data loss.data [0]&gt;&gt; loss.data这是我的输出:计时器:11.9583秒。迭代0 || 损失:11728.9388 || 计时器:0.2955秒。重复10 || 损失:南|| 计时器:0.2843秒。iter 20 || 损失:南|| 计时器:0.2890秒。iter 30 || 损失:南|| 计时器:0.2934秒 iter 40 || 损失:南|| 计时器:0.2865秒。重复50 || 损失:南|| 计时器:0.2855秒。iter 60 || 损失:南|| 计时器:0.2889秒 iter 70 || 损失:南|| 计时器:0.2857秒。iter 80 || 损失:南|| 计时器:0.2843秒。iter 90 || 损失:南|| 计时器:0.2835秒。重复100 || 损失:南|| 计时器:0.2846秒。iter 110 || 损失:南|| 计时器:0.2946秒。iter 120 || 损失:南|| 计时器:0.2860秒。iter 130 || 损失:南|| 计时器:0.2846秒。iter 140 || 损失:南|| 计时器:0.2962秒 重复150 || 损失:南|| 计时器:0.2989秒。iter 160 || 损失:南|| 计时器:0.2857秒。我在这里遇到过同样的问题,您解决了这个问题吗?我不更改114行,然后nan损失消失了。—您收到此评论是因为您发表了评论。直接回复此电子邮件,在GitHub上查看或取消订阅。重复50 || 损失:南|| 计时器:0.2855秒。iter 60 || 损失:南|| 计时器:0.2889秒 iter 70 || 损失:南|| 计时器:0.2857秒。iter 80 || 损失:南|| 计时器:0.2843秒。iter 90 || 损失:南|| 计时器:0.2835秒。重复100 || 损失:南|| 计时器:0.2846秒。iter 110 || 损失:南|| 计时器:0.2946秒。iter 120 || 损失:南|| 计时器:0.2860秒。iter 130 || 损失:南|| 计时器:0.2846秒。iter 140 || 损失:南|| 计时器:0.2962秒 重复150 || 损失:南|| 计时器:0.2989秒。iter 160 || 损失:南|| 计时器:0.2857秒。我在这里遇到过同样的问题,您解决了这个问题吗?我不更改114行,然后nan损失消失了。—您收到此评论是因为您发表了评论。直接回复此电子邮件,在GitHub上查看或取消订阅。重复50 || 损失:南|| 计时器:0.2855秒。iter 60 || 损失:南|| 计时器:0.2889秒 iter 70 || 损失:南|| 计时器:0.2857秒。iter 80 || 损失:南|| 计时器:0.2843秒。iter 90 || 损失:南|| 计时器:0.2835秒。重复100 || 损失:南|| 计时器:0.2846秒。iter 110 || 损失:南|| 计时器:0.2946秒。iter 120 || 损失:南|| 计时器:0.2860秒。iter 130 || 损失:南|| 计时器:0.2846秒。iter 140 || 损失:南|| 计时器:0.2962秒 重复150 || 损失:南|| 计时器:0.2989秒。iter 160 || 损失:南|| 计时器:0.2857秒。我在这里遇到过同样的问题,您解决了这个问题吗?我不更改114行,然后nan损失消失了。—您收到此评论是因为您发表了评论。直接回复此电子邮件,在GitHub上查看或取消订阅。iter 70 || 损失:南|| 计时器:0.2857秒。iter 80 || 损失:南|| 计时器:0.2843秒。iter 90 || 损失:南|| 计时器:0.2835秒。重复100 || 损失:南|| 计时器:0.2846秒。iter 110 || 损失:南|| 计时器:0.2946秒。iter 120 || 损失:南|| 计时器:0.2860秒。iter 130 || 损失:南|| 计时器:0.2846秒。iter 140 || 损失:南|| 计时器:0.2962秒 重复150 || 损失:南|| 计时器:0.2989秒。iter 160 || 损失:南|| 计时器:0.2857秒。我在这里遇到过同样的问题,您解决了这个问题吗?我不更改114行,然后nan损失消失了。—您收到此评论是因为您发表了评论。直接回复此电子邮件,在GitHub上查看或取消订阅。iter 70 || 损失:南|| 计时器:0.2857秒。iter 80 || 损失:南|| 计时器:0.2843秒。iter 90 || 损失:南|| 计时器:0.2835秒。重复100 || 损失:南|| 计时器:0.2846秒。iter 110 || 损失:南|| 计时器:0.2946秒。iter 120 || 损失:南|| 计时器:0.2860秒。iter 130 || 损失:南|| 计时器:0.2846秒。iter 140 || 损失:南|| 计时器:0.2962秒 重复150 || 损失:南|| 计时器:0.2989秒。iter 160 || 损失:南|| 计时器:0.2857秒。我在这里遇到过同样的问题,您解决了这个问题吗?我不更改114行,然后nan损失消失了。—您收到此评论是因为您发表了评论。直接回复此电子邮件,在GitHub上查看或取消订阅。楠|| 计时器:0.2946秒。iter 120 || 损失:南|| 计时器:0.2860秒。iter 130 || 损失:南|| 计时器:0.2846秒。iter 140 || 损失:南|| 计时器:0.2962秒 重复150 || 损失:南|| 计时器:0.2989秒。iter 160 || 损失:南|| 计时器:0.2857秒。我在这里遇到过同样的问题,您解决了这个问题吗?我不更改114行,然后nan损失消失了。—您收到此评论是因为您发表了评论。直接回复此电子邮件,在GitHub上查看或取消订阅。楠|| 计时器:0.2946秒。iter 120 || 损失:南|| 计时器:0.2860秒。iter 130 || 损失:南|| 计时器:0.2846秒。iter 140 || 损失:南|| 计时器:0.2962秒 重复150 || 损失:南|| 计时器:0.2989秒。iter 160 || 损失:南|| 计时器:0.2857秒。我在这里遇到过同样的问题,您解决了这个问题吗?我不更改114行,然后nan损失消失了。—您收到此评论是因为您发表了评论。直接回复此电子邮件,在GitHub上查看或取消订阅。然后南损失消失了。—您收到此评论是因为您发表了评论。直接回复此电子邮件,在GitHub上查看或取消订阅。然后南损失消失了。—您收到此评论是因为您发表了评论。直接回复此电子邮件,在GitHub上查看或取消订阅。

good! It work very good! Tank you !

@SalahAdDin
Copy link

@haibochina What?

@haibochina
Copy link

@haibochina What?
It means that the loss:loc_loss,conf_loss are out of range of your ram. So you can change the source code as following : N = num_pos.data.sum(), loss_l / = N, loss_c / = N, loc_loss + = loss_l.item()conf_loss + = loss_c.item()

@SalahAdDin
Copy link

I think PR are welcommed.

@up2m
Copy link

up2m commented Nov 24, 2019

thank you @haibochina ,about the issue of lose=nan, your method is very good!

@J0hannB
Copy link

J0hannB commented Dec 13, 2019

I also had a nan loss issue after fixing multibox_loss.py

In my case it was because I was trying to use custom annotations and loading them as [x_center, y_center, width, height]

If anyone else is trying to do the same thing, the correct format is [x1, y1, x2, y2]

Training works now

@Json0926
Copy link

Pytorch version:

>>> import torch
>>> print(torch.__version__)
1.1.0

Python version:

Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux

multibox_loss.py:

Switch the two lines 97,98:
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0 # filter out pos boxes for now
Change line114 
N = num_pos.data.sum() -> N = num_pos.data.sum().double()
and change the following two lines to: 
loss_l = loss_l.double()
loss_c = loss_c.double()

train.py

loss_l.data[0] >> loss_l.data 
loss_c.data[0] >> loss_c.data 
loss.data[0] >> loss.data

And here is my output:

timer: 11.9583 sec.
iter 0 || Loss: 11728.9388 || timer: 0.2955 sec.
iter 10 || Loss: nan || timer: 0.2843 sec.
iter 20 || Loss: nan || timer: 0.2890 sec.
iter 30 || Loss: nan || timer: 0.2934 sec.
iter 40 || Loss: nan || timer: 0.2865 sec.
iter 50 || Loss: nan || timer: 0.2855 sec.
iter 60 || Loss: nan || timer: 0.2889 sec.
iter 70 || Loss: nan || timer: 0.2857 sec.
iter 80 || Loss: nan || timer: 0.2843 sec.
iter 90 || Loss: nan || timer: 0.2835 sec.
iter 100 || Loss: nan || timer: 0.2846 sec.
iter 110 || Loss: nan || timer: 0.2946 sec.
iter 120 || Loss: nan || timer: 0.2860 sec.
iter 130 || Loss: nan || timer: 0.2846 sec.
iter 140 || Loss: nan || timer: 0.2962 sec.
iter 150 || Loss: nan || timer: 0.2989 sec.
iter 160 || Loss: nan || timer: 0.2857 sec.

Because of the loss too big, I change line 115 to

   N = num_pos.data.sum().double()
   loss_l = loss_l.double()
   loss_c = loss_c.double()
   loss_l /= N
   loss_c /= N

solve the issue

@ynjiun
Copy link

ynjiun commented May 26, 2020

@TianSong1991, I follow your solution and got it running normally... but after a while (after iter 90) the loss exploded to nan..., did you experience the same thing?
timer: 6.1760 sec.
iter 0 || Loss: 31.7677 || timer: 0.3297 sec.
iter 10 || Loss: 24.6710 || timer: 0.3164 sec.
iter 20 || Loss: 24.0278 || timer: 0.3214 sec.
iter 30 || Loss: 25.0901 || timer: 0.3184 sec.
iter 40 || Loss: 16.9485 || timer: 0.3358 sec.
iter 50 || Loss: 17.5748 || timer: 0.3850 sec.
iter 60 || Loss: 26.2674 || timer: 0.3207 sec.
iter 70 || Loss: 20.7441 || timer: 0.3213 sec.
iter 80 || Loss: 16.5515 || timer: 0.3206 sec.
iter 90 || Loss: 25808.9131 || timer: 0.3171 sec.
iter 100 || Loss: nan || timer: 0.3274 sec.
iter 110 || Loss: nan || timer: 0.3548 sec.
iter 120 || Loss: nan || timer: 0.3141 sec.
iter 130 || Loss: nan || timer: 0.3231 sec.
iter 140 || Loss: nan || timer: 0.3254 sec.
iter 150 || Loss: nan || timer: 0.3174 sec.
iter 160 || Loss: nan || timer: 0.3144 sec.
iter 170 || Loss: nan || timer: 0.3679 sec.
iter 180 || Loss: nan || timer: 0.3631 sec.
iter 190 || Loss: nan || timer: 0.3516 sec.
iter 200 || Loss: nan || timer: 0.3692 sec.
iter 210 || Loss: nan || timer: 0.3523 sec.
iter 220 || Loss: nan || timer: 0.3204 sec.
iter 230 || Loss: nan || timer: 0.3151 sec.
iter 240 || Loss: nan || timer: 0.3210 sec.
iter 250 || Loss: nan || timer: 0.3241 sec.
iter 260 || Loss: nan || timer: 0.3217 sec.
iter 270 || Loss: nan || timer: 0.3156 sec.
iter 280 || Loss: nan || timer: 0.3125 sec.
iter 290 || Loss: nan || timer: 0.3196 sec.
iter 300 || Loss: nan || timer: 0.3172 sec.

@ynjiun
Copy link

ynjiun commented May 26, 2020

with @TianSong1991 solution except the step3 changed to following:
setp 3 change the train.py! step3: change the line183,184,188,191:
loss_l.data[0] >> loss_l.item()
loss_c.data[0] >> loss_c.item()
loss.data[0] >> loss.item()
#now loss is converging...
timer: 6.1581 sec.
iter 0 || Loss: 32.3338 || timer: 0.3283 sec.
iter 10 || Loss: 24.8091 || timer: 0.3328 sec.
iter 20 || Loss: 24.4980 || timer: 0.3275 sec.
iter 30 || Loss: 21.3105 || timer: 0.3167 sec.
iter 40 || Loss: 14.5682 || timer: 0.3223 sec.
iter 50 || Loss: 13.0729 || timer: 0.3221 sec.
iter 60 || Loss: 12.3032 || timer: 0.3383 sec.
iter 70 || Loss: 10.5260 || timer: 0.3246 sec.
iter 80 || Loss: 11.2028 || timer: 0.3380 sec.
iter 90 || Loss: 10.1715 || timer: 0.3244 sec.
iter 100 || Loss: 10.1702 || timer: 0.3342 sec.
iter 110 || Loss: 9.8668 || timer: 0.3384 sec.
iter 120 || Loss: 9.5938 || timer: 0.3676 sec.
iter 130 || Loss: 10.0942 || timer: 0.3210 sec.
iter 140 || Loss: 9.7601 || timer: 0.3246 sec.
iter 150 || Loss: 10.1564 || timer: 0.3202 sec.
iter 160 || Loss: 9.8361 || timer: 0.3215 sec.
iter 170 || Loss: 9.3565 || timer: 0.3290 sec.
iter 180 || Loss: 9.2069 || timer: 0.3481 sec.
iter 190 || Loss: 9.0822 || timer: 0.3374 sec.
iter 200 || Loss: 9.3702 || timer: 0.3333 sec.
iter 210 || Loss: 9.6193 || timer: 0.3437 sec.
iter 220 || Loss: 9.1466 || timer: 0.3590 sec.
iter 230 || Loss: 8.8923 || timer: 0.3211 sec.
iter 240 || Loss: 9.2617 || timer: 0.3526 sec.
iter 250 || Loss: 9.1713 || timer: 0.3263 sec.
iter 260 || Loss: 9.4524 || timer: 0.3262 sec.
iter 270 || Loss: 9.4929 || timer: 0.3581 sec.
iter 280 || Loss: 8.7274 || timer: 0.3345 sec.
iter 290 || Loss: 9.6723 || timer: 0.3701 sec.
......

@yingjun-zhang
Copy link

with @TianSong1991 solution except the step3 changed to following:
setp 3 change the train.py! step3: change the line183,184,188,191:
loss_l.data[0] >> loss_l.item()
loss_c.data[0] >> loss_c.item()
loss.data[0] >> loss.item()
#now loss is converging...
timer: 6.1581 sec.
iter 0 || Loss: 32.3338 || timer: 0.3283 sec.
iter 10 || Loss: 24.8091 || timer: 0.3328 sec.
iter 20 || Loss: 24.4980 || timer: 0.3275 sec.
iter 30 || Loss: 21.3105 || timer: 0.3167 sec.
iter 40 || Loss: 14.5682 || timer: 0.3223 sec.
iter 50 || Loss: 13.0729 || timer: 0.3221 sec.
iter 60 || Loss: 12.3032 || timer: 0.3383 sec.
iter 70 || Loss: 10.5260 || timer: 0.3246 sec.
iter 80 || Loss: 11.2028 || timer: 0.3380 sec.
iter 90 || Loss: 10.1715 || timer: 0.3244 sec.
iter 100 || Loss: 10.1702 || timer: 0.3342 sec.
iter 110 || Loss: 9.8668 || timer: 0.3384 sec.
iter 120 || Loss: 9.5938 || timer: 0.3676 sec.
iter 130 || Loss: 10.0942 || timer: 0.3210 sec.
iter 140 || Loss: 9.7601 || timer: 0.3246 sec.
iter 150 || Loss: 10.1564 || timer: 0.3202 sec.
iter 160 || Loss: 9.8361 || timer: 0.3215 sec.
iter 170 || Loss: 9.3565 || timer: 0.3290 sec.
iter 180 || Loss: 9.2069 || timer: 0.3481 sec.
iter 190 || Loss: 9.0822 || timer: 0.3374 sec.
iter 200 || Loss: 9.3702 || timer: 0.3333 sec.
iter 210 || Loss: 9.6193 || timer: 0.3437 sec.
iter 220 || Loss: 9.1466 || timer: 0.3590 sec.
iter 230 || Loss: 8.8923 || timer: 0.3211 sec.
iter 240 || Loss: 9.2617 || timer: 0.3526 sec.
iter 250 || Loss: 9.1713 || timer: 0.3263 sec.
iter 260 || Loss: 9.4524 || timer: 0.3262 sec.
iter 270 || Loss: 9.4929 || timer: 0.3581 sec.
iter 280 || Loss: 8.7274 || timer: 0.3345 sec.
iter 290 || Loss: 9.6723 || timer: 0.3701 sec.
......

what's your torch version and python version?

@He-zl8
Copy link

He-zl8 commented Aug 1, 2020

When encountered
timer: 10.2599 sec.
iter 0 || Loss: 30.8010 || timer: 0.4961 sec.
iter 10 || Loss: 19.9977 || timer: 1.1120 sec.
iter 20 || Loss: 19.2539 || timer: 1.8164 sec.
iter 30 || Loss: 16.7701 || timer: 0.9436 sec.
iter 40 || Loss: 18.0430 || timer: 0.7898 sec.
iter 50 || Loss: 25.5106 || timer: 1.0395 sec.
iter 60 || Loss: 23.7020 || timer: 0.8617 sec.
iter 70 || Loss: nan || timer: 1.0497 sec.
iter 80 || Loss: nan || timer: 1.2802 sec.

maybe you can change lr=1e-4,when i change ,then

timer: 10.1423 sec.
iter 0 || Loss: 29.5713 || timer: 0.4259 sec.
iter 10 || Loss: 22.9357 || timer: 1.2987 sec.
iter 20 || Loss: 20.2871 || timer: 1.1511 sec.
iter 30 || Loss: 20.0152 || timer: 0.9707 sec.
iter 40 || Loss: 19.3170 || timer: 0.9684 sec.
iter 50 || Loss: 19.0578 || timer: 1.0160 sec.
iter 60 || Loss: 19.2979 || timer: 1.2673 sec.
iter 70 || Loss: 18.9950 || timer: 1.1985 sec.
iter 80 || Loss: 16.6445 || timer: 1.2570 sec.

@cotyyang
Copy link

cotyyang commented Sep 22, 2020

I solve the problem if your python torch version is 1.0.1. The solution as follow 1-3 steps:
step1 and step2 change the multibox_loss.py!
step1: switch the two lines 97,98:
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0 # filter out pos boxes for now
step2: change the line114 N = num_pos.data.sum() to
N = num_pos.data.sum().double()
loss_l = loss_l.double()
loss_c = loss_c.double()
setp 3 change the train.py! step3: change the line188,189,193,196: loss_l.data[0] >> loss_l.data loss_c.data[0] >> loss_c.data loss.data[0] >> loss.data

thanks,this answer solves my problem.

@hjlee9182
Copy link

I solve the problem if your python torch version is 1.0.1. The solution as follow 1-3 steps:
step1 and step2 change the multibox_loss.py!
step1: switch the two lines 97,98:
loss_c = loss_c.view(num, -1)
loss_c[pos] = 0 # filter out pos boxes for now
step2: change the line114 N = num_pos.data.sum() to
N = num_pos.data.sum().double()
loss_l = loss_l.double()
loss_c = loss_c.double()
setp 3 change the train.py! step3: change the line188,189,193,196: loss_l.data[0] >> loss_l.data loss_c.data[0] >> loss_c.data loss.data[0] >> loss.data

I'm also this answer solved my probelm.
more correctly,
loss_l = loss_l.double()/N
loss_c = loss_c.doubel()/N
:)

@He-zl8
Copy link

He-zl8 commented Oct 23, 2020 via email

@Certseeds
Copy link

Certseeds commented Mar 25, 2021

if loss is nan,maybe the learning_rate is too large.

@knotgrass
Copy link

if loss is nan,maybe the learning_rate is too large.

or batch_size is too small or both

ashiks-qb added a commit to ashiks-qb/ssd.pytorch that referenced this issue Oct 30, 2021
@EsakaK
Copy link

EsakaK commented Mar 14, 2022

There is still a problem. In step 1, it should be changer like this:

N = num_pos.data.sum().double()
loss_l = loss_l.double()/N
loss_c = loss_c.double()/N

otherwise the loss will be a 'nan'.

yodhcn added a commit to yodhcn/ssd.pytorch that referenced this issue May 17, 2022
@sonukiller
Copy link

sonukiller commented Jul 7, 2023

If you are using PyTorch 2, please follow this:

  1. In multibox_loss.py,
    Swap line no. 97 and 98

  2. In trainer.py,
    Line no. ~183: replace loc_loss += loss_l.data[0] with loc_loss += loss_l.item()
    Line no. ~184: replace conf_loss += loss_c.data[0] with conf_loss += loss_c.item()
    Line no. ~188 in print, replace loss.data[0] with loss.item()

This solved my problem!

@zuliani99
Copy link

@sonukiller I'm still getting nan loss even with your suggestion and the previous one.

Do you suggest t remove all the .data attribute and substitute Variable with classic torch.tensor?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests