Encountered NaN under the ConditionalDecoderVISEM setting #1

canerozer · 2021-01-03T21:03:33Z

Hello,

Thanks for open-sourcing this beautiful project. I am currently trying to replicate the results in Table 1 of the paper, however, while I was trying to train the Conditional model, I have experienced this error message stating that a NaN loss value has been encountered, after training the model for 222 epochs. Just to mention, I performed a clean installation of the necessary libraries and I have used the Morpho-MNIST data creation script which you provided. Would there be anything wrong with the calculation of the ELBO of the p(intensity), since that's the only metric that has gone to NaN?

Steps to reproduce the behavior:

python -m deepscm.experiments.morphomnist.trainer -e SVIExperiment -m ConditionalDecoderVISEM --data_dir data/morphomnist/ --default_root_dir checkpoints/ --decoder_type fixed_var --gpus 0

Whole Error Message:

Epoch 222: 6%|▋ | 15/251 [00:01<00:28, 8.31it/s, loss=952243.375, v_num=2]/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pyro/infer/tracegraph_elbo.py:261: UserWarning: Encountered NaN: loss
warn_if_nan(loss, "loss")
Traceback (most recent call last):
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ilkay/Documents/caner/deepscm/deepscm/experiments/morphomnist/trainer.py", line 62, in
trainer.fit(experiment)
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 859, in fit
self.single_gpu_train(model)
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 503, in single_gpu_train
self.run_pretrain_routine(model)
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1015, in run_pretrain_routine
self.train()
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
self.run_training_epoch()
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 419, in run_training_epoch
_outputs = self.run_training_batch(batch, batch_idx)
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 597, in run_training_batch
loss, batch_output = optimizer_closure()
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 561, in optimizer_closure
output_dict = self.training_forward(split_batch, batch_idx, opt_idx, self.hiddens)
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 727, in training_forward
output = self.model.training_step(*args)
File "/home/ilkay/Documents/caner/deepscm/deepscm/experiments/morphomnist/sem_vi/base_sem_experiment.py", line 385, in training_step
raise ValueError('loss went to nan with metrics:\n{}'.format(metrics))
ValueError: loss went to nan with metrics:
{'log p(x)': tensor(-3502.9570, device='cuda:0', grad_fn=), 'log p(intensity)': tensor(nan, device='cuda:0', grad_fn=), 'log p(thickness)': tensor(-0.9457, device='cuda:0', grad_fn=), 'p(z)': tensor(-22.2051, device='cuda:0', grad_fn=), 'q(z)': tensor(54.3670, device='cuda:0', grad_fn=), 'log p(z) - log q(z)': tensor(-76.5721, device='cuda:0', grad_fn=)}
Exception ignored in: <function tqdm.del at 0x7ffb7e9d2320>
Traceback (most recent call last):
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1135, in del
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1282, in close
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1467, in display
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1138, in repr
File "/home/ilkay/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1425, in format_dict
TypeError: cannot unpack non-iterable NoneType object

Environment

PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Quadro RTX 6000
Nvidia driver version: 440.100
cuDNN version: /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] pytorch-lightning==0.7.6
[pip3] torch==1.7.1
[pip3] torchvision==0.6.0a0+35d732a
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] mkl 2020.1 217
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.1.0 py37h23d657b_0
[conda] mkl_random 1.1.1 py37h0573a6f_0
[conda] numpy 1.19.4 pypi_0 pypi
[conda] pytorch-lightning 0.7.6 pypi_0 pypi
[conda] torch 1.7.1 pypi_0 pypi
[conda] torchvision 0.6.1 py37_cu102 pytorch

pawni · 2021-01-04T18:04:17Z

Thanks for you interest in our paper and the code. I've just gotten around to start updating the dependencies as it seems that you're at least using a newer PyTorch version than what we were using.

I'll get back to you once I updated everything and was able to run the experiments with the new versions.

In the meantime, are you using the fixed pyro version? Also you can experiment with different (lower) learning rates via: --pgm_lr 1e-3 and --lr 5e-5 or get more detailed error messages by using the --validate flag.

However, it seems that only 'log p(intensity)': tensor(nan, device='cuda:0', grad_fn=) went to NaN. It's something we observed sometimes as well and lowering the pgm_lr usually helped :)

canerozer · 2021-01-10T22:40:07Z

Hello,

Thanks for your suggestion, I see that the model has been successfully trained after reducing the PGM learning rate. Just to understand it clearly, the pgm_lr hyperparameter affects only the spline layer of the thickness flow and the affine transformation layer of the intensity flow, right? I will take a look at that paper asap.

Meanwhile, I have tried to train the Normalizing Flow models (all 3 settings) but I noticed that they still have a tendency to go to NaN even after a couple of epochs with a learning rate of 10^-4. I am now trying with a learning rate of 10^-5 but I don't know whether that would solve this problem.

BTW, I was using the pyro version of your suggestion (1.3.1+4b2752f8) but I will update the repository right after completing that training attempt.

Edit: Still failing for normalizing flow experiments.

pawni · 2021-01-11T19:15:48Z

Oh wait - so you're training a flow only model?

It's only included in the code here for completeness but we also had the issue of running into NaNs when running with flows only which is why we settled on the VI solution.

As for pgm_lr - yes it only acts on the flows for the covariates and not the image components.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encountered NaN under the ConditionalDecoderVISEM setting #1

Encountered NaN under the ConditionalDecoderVISEM setting #1

canerozer commented Jan 3, 2021

pawni commented Jan 4, 2021

canerozer commented Jan 10, 2021 •

edited

pawni commented Jan 11, 2021

Encountered NaN under the ConditionalDecoderVISEM setting #1

Encountered NaN under the ConditionalDecoderVISEM setting #1

Comments

canerozer commented Jan 3, 2021

Steps to reproduce the behavior:

Whole Error Message:

Environment

pawni commented Jan 4, 2021

canerozer commented Jan 10, 2021 • edited

pawni commented Jan 11, 2021

canerozer commented Jan 10, 2021 •

edited