Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss did not drop when training the blender dataset #75

Closed
ChangXu-Rick opened this issue May 12, 2021 · 4 comments
Closed

Loss did not drop when training the blender dataset #75

ChangXu-Rick opened this issue May 12, 2021 · 4 comments
Labels
bug Something isn't working good first issue Good for newcomers wontfix This will not be worked on

Comments

@ChangXu-Rick
Copy link

ChangXu-Rick commented May 12, 2021

Hi,

I am using your provide Colab code to training my won data.
At first, I used the LLFF to extract the camera poses of images. Your code can output a wonderful result!
Then, I was trying to use blender to generate the ground true poses as transforms.json. I separated my dataset to train set (200 images) and val set (100 images) with transforms_train.json and transforms_val.json. However, this time your colab code cannot work. At this time, I was thinking maybe I generate wrong transforms.json file. However, when I test with mic dataset from nerf_synthetic, it still cannot work.

Your colab code only has 360 inward-facing scene and Forward facing scene. I am adding a new block of code so it can run the blender scene.

%cd /content/nerf_pl
import os
os.environ['ROOT_DIR'] = "/content/drive/My Drive/mic"
os.environ['EXP'] = "mic"
!python train.py
--dataset_name blender
--root_dir "$ROOT_DIR"
--N_importance 64 --img_wh 200 200 --noise_std 0
--num_epochs 20 --batch_size 1024
--optimizer adam --lr 5e-4 --lr_scheduler cosine
--exp_name exp \

A part of the mic scene training log is shown below:
Epoch 1: 100% 3907/3915 [11:59<00:01, 5.43it/s, loss=0.093, train_psnr=12.5, v_num=2]
Validating: 0it [00:00, ?it/s]

Epoch 1: 100% 3908/3915 [12:01<00:01, 5.42it/s, loss=0.093, train_psnr=12.5, v_num=2]
Epoch 1: 100% 3909/3915 [12:03<00:01, 5.40it/s, loss=0.093, train_psnr=12.5, v_num=2]
Epoch 1: 100% 3910/3915 [12:05<00:00, 5.39it/s, loss=0.093, train_psnr=12.5, v_num=2]
Epoch 1: 100% 3911/3915 [12:07<00:00, 5.37it/s, loss=0.093, train_psnr=12.5, v_num=2]
Epoch 1: 100% 3912/3915 [12:09<00:00, 5.36it/s, loss=0.093, train_psnr=12.5, v_num=2]
Epoch 1: 100% 3913/3915 [12:11<00:00, 5.35it/s, loss=0.093, train_psnr=12.5, v_num=2]
Epoch 1: 100% 3914/3915 [12:14<00:00, 5.33it/s, loss=0.093, train_psnr=12.5, v_num=2]
Epoch 1: 100% 3915/3915 [12:16<00:00, 5.32it/s, loss=0.093, train_psnr=12.5, v_num=2, val_loss=0.0931, val_psnr=13.3]
Epoch 2: 100% 3907/3915 [12:00<00:01, 5.42it/s, loss=0.095, train_psnr=12.4, v_num=2, val_loss=0.0931, val_psnr=13.3]
Validating: 0it [00:00, ?it/s]
Epoch 2: 100% 3908/3915 [12:02<00:01, 5.41it/s, loss=0.095, train_psnr=12.4, v_num=2, val_loss=0.0931, val_psnr=13.3]
Epoch 2: 100% 3909/3915 [12:04<00:01, 5.39it/s, loss=0.095, train_psnr=12.4, v_num=2, val_loss=0.0931, val_psnr=13.3]
Epoch 2: 100% 3910/3915 [12:06<00:00, 5.38it/s, loss=0.095, train_psnr=12.4, v_num=2, val_loss=0.0931, val_psnr=13.3]
Epoch 2: 100% 3911/3915 [12:08<00:00, 5.37it/s, loss=0.095, train_psnr=12.4, v_num=2, val_loss=0.0931, val_psnr=13.3]
Epoch 2: 100% 3912/3915 [12:11<00:00, 5.35it/s, loss=0.095, train_psnr=12.4, v_num=2, val_loss=0.0931, val_psnr=13.3]
Epoch 2: 100% 3913/3915 [12:13<00:00, 5.34it/s, loss=0.095, train_psnr=12.4, v_num=2, val_loss=0.0931, val_psnr=13.3]
Epoch 2: 100% 3914/3915 [12:15<00:00, 5.32it/s, loss=0.095, train_psnr=12.4, v_num=2, val_loss=0.0931, val_psnr=13.3]
Epoch 2: 100% 3915/3915 [12:17<00:00, 5.31it/s, loss=0.095, train_psnr=12.4, v_num=2, val_loss=0.0931, val_psnr=13.3]
Epoch 3: 100% 3907/3915 [12:01<00:01, 5.41it/s, loss=0.093, train_psnr=13.3, v_num=2, val_loss=0.0931, val_psnr=13.3]
Validating: 0it [00:00, ?it/s]
Epoch 3: 100% 3908/3915 [12:04<00:01, 5.40it/s, loss=0.093, train_psnr=13.3, v_num=2, val_loss=0.0931, val_psnr=13.3]
Epoch 3: 100% 3909/3915 [12:06<00:01, 5.38it/s, loss=0.093, train_psnr=13.3, v_num=2, val_loss=0.0931, val_psnr=13.3]
Epoch 3: 100% 3910/3915 [12:08<00:00, 5.37it/s, loss=0.093, train_psnr=13.3, v_num=2, val_loss=0.0931, val_psnr=13.3]
Epoch 3: 100% 3911/3915 [12:10<00:00, 5.35it/s, loss=0.093, train_psnr=13.3, v_num=2, val_loss=0.0931, val_psnr=13.3]
Epoch 3: 100% 3912/3915 [12:12<00:00, 5.34it/s, loss=0.093, train_psnr=13.3, v_num=2, val_loss=0.0931, val_psnr=13.3]
Epoch 3: 100% 3913/3915 [12:14<00:00, 5.33it/s, loss=0.093, train_psnr=13.3, v_num=2, val_loss=0.0931, val_psnr=13.3]
Epoch 3: 100% 3914/3915 [12:16<00:00, 5.31it/s, loss=0.093, train_psnr=13.3, v_num=2, val_loss=0.0931, val_psnr=13.3]
Epoch 3: 100% 3915/3915 [12:18<00:00, 5.30it/s, loss=0.093, train_psnr=13.3, v_num=2, val_loss=0.0931, val_psnr=13.3]
Epoch 4: 1% 31/3915 [00:06<13:02, 4.96it/s, loss=0.092, train_psnr=12.9, v_num=2, val_loss=0.0931, val_psnr=13.3]

From the log you can see both loss and val_loss are not decreasing, and I cannot extract any mesh from the model.
As you said you managed to train on all provided blender scenes , could you tell me where I got wrong?

Many thanks!

@kwea123
Copy link
Owner

kwea123 commented May 13, 2021

Some blender scenes struggle to converge from the beginning because of the issue bmild/nerf#29
You can try cropping the image, or use softplus activation like the comments in that issue.
Personally I succeeded in training these scenes with these modifications.
I will pin this issue because so many people have asked the same question...

@kwea123 kwea123 closed this as completed May 13, 2021
@kwea123 kwea123 pinned this issue May 13, 2021
@kwea123 kwea123 added bug Something isn't working good first issue Good for newcomers wontfix This will not be worked on labels May 13, 2021
@ChangXu-Rick
Copy link
Author

Thank you! That is really helpful!

Could you provide your training settings for the blender scenes with a lot of whitespace such as mic?

@kwea123
Copy link
Owner

kwea123 commented May 13, 2021

In my experiments, I managed to train with many trials until I get lucky initializations that converges, i.e. if the initialization is bad and the loss doesn't decrease, I just stop and train it again and again. Since I only wanted to test my code, I only trained on a few scenes, not all of them. More specifically, I only trained lego, chair, hotdog and material if I remember correctly, so I don't know how mic works.
More systematic ways are like I mentioned above, use cropping or softplus activation. Softplus is very quick to try, so I'd recommend that you do this first. If you find it helpful, I'm happy to modify my code to softplus. There was an issue #51 mentioning this, but he didn't share his results finally, so the current code still uses relu.

@tau-yihouxiang
Copy link

@kwea123 Yes, softplus makes it stable during training. However, several scenes, like drums, mic, and ficus still cannot convergence. I think the reason might be the small area of the foreground. Could you validate the observation? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants