Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Magnitude of gradient is bad: -nan` when trying to train frameid #38

Open
DiamondRock opened this issue Dec 27, 2019 · 4 comments

Comments

@DiamondRock
Copy link

DiamondRock commented Dec 27, 2019

I am trying to train the frameid model, but I get this error at the very beginning of training. I am using the latest version of dynet (2.1). I have ported the open-sesame to python 3, and I am using the python 3 version for training, but even with python 2.7 version, I am still getting the same error.

Traceback (most recent call last): File "/home/anaconda3/envs/pytorch_dynet_copy/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/anaconda3/envs/pytorch_dynet_copy/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/testframenet/open-sesame/sesame/frameid.py", line 295, in <module> trainer.update() File "_dynet.pyx", line 6198, in _dynet.Trainer.update File "_dynet.pyx", line 6203, in _dynet.Trainer.update RuntimeError: Magnitude of gradient is bad: -nan

@clingergab
Copy link

I am getting the same issue at the very first epoch:
raceback (most recent call last):= 0.6202 (1409/2272) best_val_f1 = 0.6202: 11%| | 2107/19391 [04:13<4:11: File "/home/gabriel/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/gabriel/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/gabriel/open-sesame/sesame/frameid.py", line 330, in <module> trainer.update() File "_dynet.pyx", line 6198, in _dynet.Trainer.update File "_dynet.pyx", line 6203, in _dynet.Trainer.update RuntimeError: Magnitude of gradient is bad: -nan

I tried decreasing the learning rate but it hasn't helped.
Any suggestions?

@cjcourt
Copy link

cjcourt commented Jun 14, 2021

I am also observing this exact issue training the frameid model with python 3.7, ubuntu18.04 dynet 2.1. I have tried multiple different trainers SGDTrainer, AdagradTrainer and AdamTrainer, each with many different learning rates from 0.1 to 1e-6.

Any suggestions would be greatly appreciated.

@ravy101
Copy link

ravy101 commented Dec 30, 2021

I had the same issue but after some trial and error I found that some loss values that were not 'None' but became NaN when .scalar_value() was called. I added a NaN check to frameid.py training code.
Just import math and replace
if trexloss is not None:

With:
if trexloss is not None and not math.isnan(trexloss.scalar_value()):

Hope this helps.

@JerrisonChang
Copy link

Thank you @ravy101 . I came into the same issue and the proposed solution helped me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants