Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid Loss error when training on the GPU #51

Open
dongishan opened this issue Sep 28, 2018 · 6 comments
Open

Invalid Loss error when training on the GPU #51

dongishan opened this issue Sep 28, 2018 · 6 comments

Comments

@dongishan
Copy link

dongishan commented Sep 28, 2018

What could be the reason for the Invalid Loss error to be present in GPU training and not in the CPU training?

I've successfully trained the WTTE-RNN algorithm on a CPU using a GRU RNN on the C-MAPSS dataset. However, when doing the same on a Nvidia GPU with CuDNNGRU I am getting the Invalid Loss error at around 20/100 epochs.

I am using Keras with Tensorflow backend. And the WTTE-RNN version is 1.1.1.

@ragulpr
Copy link
Owner

ragulpr commented Oct 1, 2018

Great update. I don't have access to GPU right now so I haven't run the unit-tests for CUDA for a while.

One initial theory is that I think cuda-batchnorm is (must) be different than Keras-cpu-batchnorm, since the latter accepts a mask and as far as I know, cuda-batchnorm unfortunately doesn't. I'm not sure if keras calls the cuda-batchnorm primitives though. Are you using batchnorm? Otherwise, same goes with mask in general. I would be surprised if CuDNNGRU accepts Mask as keras cpu-version does.

Another is that the machine-epsilon is different on CPU/GPU. I recommend setting keras.backend.set_epsilon(1e-07), but I'm not sure whether GPU respects this.

As a general recommendation, I recommend clipping log-likelihood using

loss_fun = wtte.loss(kind='discrete',reduce_loss=False,clip_prob=1e-5).loss_function

@dongishan
Copy link
Author

@ragulpr Thanks for your reply. I am not using batchnorm and you are correct about the CuDNN not accepting the masking.

I will try the epsilon and log-likelihood clipping and will let you know how it goes.

@ragulpr
Copy link
Owner

ragulpr commented Oct 2, 2018

If you find anything inside WTTE not working properly with GPU it would be very good to know thanks alot for raising issue. For general NaN-avoidance there are many other git-issues with recommendations. Some top-of the list remedies for further reference;

@ragulpr
Copy link
Owner

ragulpr commented Oct 3, 2018

Another idea I forgot; I've had problems getting GPU to respect the random seed I set for it, but that might be a pytorch problem. If you repeat experiment using different seeds on CPU maybe you get the same NaN-failures?

@as2636
Copy link

as2636 commented Oct 3, 2018

Hi,

@dongishan called my attention to this post recently, and it just came to my mind today while working with the GPU. I have also observed numerical instabilities in the loss function when using it (we use the same cluster). I have observed this for WTTE-RNN, but also for an extension of it that I wrote for a gaussian-based loss function.

I had not commented anything until now because my main hypothesis was that those instabilities were due to my data being contaminated/badly pre-processed (I use real industrial data). But today I started comparing the GPU and the CPU and initial results show that the loss is much more stable for the CPU.

My architecture is quite simple, with a large batch size, two stacked 50 neuron LSTM's with regularisation, and a Timedistributed 100 neuron dense layer. I use tanh everywhere as an activation function.

With regards to numerical instability in the wtte-rnn case, I have usually been successful avoiding it by normalizing the times to event and by using the continuous log-lik. For some reason, in my data-sets the discrete mode was more prone to numerical instability. I prefer that to clipping.

Edit: Update: I have run now 4 experiments (10000 epochs each) and I observed some loss instabilities in the CPU case, but still in a much minor extent than in the GPU case.

@ragulpr
Copy link
Owner

ragulpr commented Oct 4, 2018

There may be a lot of reasons for numerical instability as pointed out, so would be very helpful if we can find a GPU/CPU reproducible example. Can it have anything to do with content of your keras.json-file? Maybe GPU is float32 and CPU is float64 or similar?

ps.
Since your final layer is dim 100, after dense(2) it will be approximately Normal(0,100) so the variance is high. I usually scale this as below

model.add(Lambda(wtte.output_lambda, arguments={"init_alpha":init_alpha, 
                                                "max_beta_value":2.0,
                                                # Stability heuristic: scale by log-number of pre-output layer inputs
                                                "scalefactor":1/np.log(100),
                                               }))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants