Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequently getting NAN for losses? Halts the training process #189

Open
JT1316 opened this issue Jun 3, 2020 · 14 comments
Open

Frequently getting NAN for losses? Halts the training process #189

JT1316 opened this issue Jun 3, 2020 · 14 comments

Comments

@JT1316
Copy link

JT1316 commented Jun 3, 2020

During training I am getting NAN for the training losses, sometimes in the first epoch and sometimes way later. Example:

progress epoch 5 step 357 image/sec 10.4 remaining 391m
discrim_loss nan
gen_loss_GAN 1.5034107
gen_loss_L1 nan

Training process looks to be working perfectly until this and then the training process halts. Any idea?

Thank you

@JT1316
Copy link
Author

JT1316 commented Jun 3, 2020

(0) Invalid argument: Nan in summary histogram for: generator/encoder_5/conv2d/kernel/values
[[node generator/encoder_5/conv2d/kernel/values (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]
[[batch/_779]]
(1) Invalid argument: Nan in summary histogram for: generator/encoder_5/conv2d/kernel/values
[[node generator/encoder_5/conv2d/kernel/values (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

@antonio0372
Copy link

I'm having the same issue and it's driving me crazy.

  • only happens with cuda when using my own images. The facades dataset successfully trained on cuda
  • works perfectly fine with CPU backend
  • works perfectly fine with directml backend

@antonio0372
Copy link

Batch Normalization is the culprit

@simantaturja
Copy link

@antonio0372 did you fix it?

@antonio0372
Copy link

antonio0372 commented Aug 5, 2020 via email

@aaaaaaaaargh
Copy link

aaaaaaaaargh commented Aug 13, 2020

Hi, can you please elaborate on that solution a little bit?

By the way, for me this issue also happened when using the CPU backend.

@antonio0372
Copy link

antonio0372 commented Aug 13, 2020 via email

@aaaaaaaaargh
Copy link

aaaaaaaaargh commented Aug 13, 2020

Antonio, thanks for your quick answer! Torch.. oh well.. I don't know anything about that, but I don't know anything about tf as well, so i guess I'm giving it a shot then :)

@skabbit
Copy link

skabbit commented Oct 4, 2020

Got the same problem on a generator decoder (not encoder):
Nan in summary histogram for: generator/decoder_5/conv2d_transpose/kernel/values

And it's started only when I'm using batch size >1, and only on my own dataset.
I suppose it happens on duplicated images in dataset. @antonio0372, may you also have duplicates in yours?

@antonio0372
Copy link

antonio0372 commented Oct 4, 2020 via email

@skabbit
Copy link

skabbit commented Oct 4, 2020

Thank for fast reply, @antonio0372!
Now I have 6 different reasons may probably resolve this issue, I'll check them all and write down here the results.

@skabbit
Copy link

skabbit commented Oct 4, 2020

Degrade Tensorflow to 1.14.0 resolve this issue.
Working fine during 150 epochs with batch size 100.

@skabbit
Copy link

skabbit commented Oct 20, 2020

But, I clearly advice NOT to use such a huge batch size, as it generalize MUCH worse. I tried 4-10 batch and it is give much more convenient result in a comparable amount of time.
I hope somebody find this useful.

@burhr2
Copy link

burhr2 commented Nov 2, 2020

As mentioned by @skabbit using TensorFlow 1.14.0 (pip install tensorflow-gpu==1.14.0) seems to work fine for now. I am using anaconda on windows 10 machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants