Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

跑到660900之后,报NaN错误 #36

Open
fjibj opened this issue Aug 6, 2019 · 3 comments
Open

跑到660900之后,报NaN错误 #36

fjibj opened this issue Aug 6, 2019 · 3 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@fjibj
Copy link

fjibj commented Aug 6, 2019

2019-08-06 00:52:38.476421: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7fb3e960d900 = {0, 1} Found Inf or
NaN global norm.Traceback (most recent call last):
File "/root/anaconda3/envs/fjpy36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/root/anaconda3/envs/fjpy36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/root/anaconda3/envs/fjpy36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had Inf values
[[{{node VerifyFinite/CheckNumerics}} = CheckNumericsT=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/r
eplica:0/task:0/device:GPU:0"
]] [[{{node clip_by_global_norm/mul_1/_301}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0
", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2818_clip_by_global_norm/mul_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"
]]

@milk-bottle-liyu
Copy link

同问,有没有解决?

@HavenTong
Copy link

我是跑到696100之后出现了同样的问题,有大神懂怎么解决吗?

2020-02-11 16:21:58.843161: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7f2456e15a00 = {0, 1} Found Inf or NaN global norm.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had Inf values
	 [[{{node VerifyFinite/CheckNumerics}} = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](global_norm/global_norm)]]
	 [[{{node clip_by_global_norm/mul_1/_159}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2818_clip_by_global_norm/mul_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

@wb14123 wb14123 added bug Something isn't working help wanted Extra attention is needed labels May 5, 2022
@wb14123
Copy link
Owner

wb14123 commented May 5, 2022

我之前也偶尔会遇到同样的问题,一般解决办法就是从 checkpoint 继续训练。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants