Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Something went wrong when I "Training with multiple seen domains" #21

Open
7749546 opened this issue Jul 28, 2020 · 10 comments
Open

Something went wrong when I "Training with multiple seen domains" #21

7749546 opened this issue Jul 28, 2020 · 10 comments

Comments

@7749546
Copy link

7749546 commented Jul 28, 2020

图片
The loss is nan. I tried to reduce the leraning rate. But it didn't work. Could you please give me some advice?
图片

@hytseng0509
Copy link
Owner

Which pytorch version do you use?

@7749546
Copy link
Author

7749546 commented Jul 29, 2020

Thank you for your reply. My torch version is 1.2.0. But I can run this command:python3 train_baseline.py --method relationnet_softmax --dataset multi --testset cars --name multi_cars_ori_relationnet_softmax --warmup baseline --train_aug.

The result seems to be ok. The result is shown below:
图片

@7749546
Copy link
Author

7749546 commented Jul 29, 2020

Which pytorch version do you use?

So, from the result I got above,should I change the torch version to above 1.3.0?If I have encountered the initial problem after changing the selected model, should I consider upgrading the torch version?

@ContestantsD
Copy link

Which pytorch version do you use?

So, from the result I got above,should I change the torch version to above 1.3.0?If I have encountered the initial problem after changing the selected model, should I consider upgrading the torch version?

are you resolve this problem? i am facing this problem too

@04556338896
Copy link

Which pytorch version do you use?

So, from the result I got above,should I change the torch version to above 1.3.0?If I have encountered the initial problem after changing the selected model, should I consider upgrading the torch version?

are you resolve this problem? i am facing this problem too
Hi, did you solve the problem? I train the model on my own dataset, but my loss is nan too.

@ContestantsD
Copy link

Which pytorch version do you use?

So, from the result I got above,should I change the torch version to above 1.3.0?If I have encountered the initial problem after changing the selected model, should I consider upgrading the torch version?

are you resolve this problem? i am facing this problem too
Hi, did you solve the problem? I train the model on my own dataset, but my loss is nan too.

我解决了这个问题,我这边出现的这个问题是因为我刻意使用cuda10.2才出现的,然而我用的30系显卡不支持cuda10,所以会nan,具体解决好像是通过修改导入模型参数的次序完成的,他的代码写的不太严谨

@04556338896
Copy link

Which pytorch version do you use?

So, from the result I got above,should I change the torch version to above 1.3.0?If I have encountered the initial problem after changing the selected model, should I consider upgrading the torch version?

are you resolve this problem? i am facing this problem too
Hi, did you solve the problem? I train the model on my own dataset, but my loss is nan too.

我解决了这个问题,我这边出现的这个问题是因为我刻意使用cuda10.2才出现的,然而我用的30系显卡不支持cuda10,所以会nan,具体解决好像是通过修改导入模型参数的次序完成的,他的代码写的不太严谨

我也是30系显卡用cuda10.2,前几个epoch有loss,后面就都是nan了,protonet和relationnet都是这个情况,请问是否记得修改的哪里的模型参数,谢谢!

@ContestantsD
Copy link

Which pytorch version do you use?

So, from the result I got above,should I change the torch version to above 1.3.0?If I have encountered the initial problem after changing the selected model, should I consider upgrading the torch version?

are you resolve this problem? i am facing this problem too
Hi, did you solve the problem? I train the model on my own dataset, but my loss is nan too.

我解决了这个问题,我这边出现的这个问题是因为我刻意使用cuda10.2才出现的,然而我用的30系显卡不支持cuda10,所以会nan,具体解决好像是通过修改导入模型参数的次序完成的,他的代码写的不太严谨

我也是30系显卡用cuda10.2,前几个epoch有loss,后面就都是nan了,protonet和relationnet都是这个情况,请问是否记得修改的哪里的模型参数,谢谢!

torch换成cuda11的,然后run一下,根据报错调吧,具体是哪儿我忘了,反正最后会卡在一个导入模型的地方,在那里调一下位置就好

@04556338896
Copy link

Which pytorch version do you use?

So, from the result I got above,should I change the torch version to above 1.3.0?If I have encountered the initial problem after changing the selected model, should I consider upgrading the torch version?

are you resolve this problem? i am facing this problem too
Hi, did you solve the problem? I train the model on my own dataset, but my loss is nan too.

我解决了这个问题,我这边出现的这个问题是因为我刻意使用cuda10.2才出现的,然而我用的30系显卡不支持cuda10,所以会nan,具体解决好像是通过修改导入模型参数的次序完成的,他的代码写的不太严谨

我也是30系显卡用cuda10.2,前几个epoch有loss,后面就都是nan了,protonet和relationnet都是这个情况,请问是否记得修改的哪里的模型参数,谢谢!

torch换成cuda11的,然后run一下,根据报错调吧,具体是哪儿我忘了,反正最后会卡在一个导入模型的地方,在那里调一下位置就好

谢谢!

@sx1999
Copy link

sx1999 commented Mar 30, 2023

Thank you for your reply. My torch version is 1.2.0. But I can run this command:python3 train_baseline.py --method relationnet_softmax --dataset multi --testset cars --name multi_cars_ori_relationnet_softmax --warmup baseline --train_aug.

The result seems to be ok. The result is shown below: 图片

hello, the dataset links in the code are invalid, could you please provide me with your datasets? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants