Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with training speed / loss #46

Open
greeneggsandyaml opened this issue Apr 2, 2024 · 6 comments
Open

Issue with training speed / loss #46

greeneggsandyaml opened this issue Apr 2, 2024 · 6 comments

Comments

@greeneggsandyaml
Copy link

Hello, I'm looking to replicate the results of this repo. I've loaded the Objaverse data (rendered in a similar manner to G-Objaverse) and I've verified that the images look right (see below). I believe that the cameras are also being loaded correctly, although it is always possible that I made an error there.

I'm finding that the network does not train successfully.

I am asking anyone (the author or anyone else who has successfully trained a model), what the training process should look like. As in, what (approximately) should the loss be at 500, 1000, 5000 steps? Does the network take forever to converge or is something wrong with my setup?

For context, my renders look like:
train_gt_images_0_0

And after 1500 steps of training (with a single 80GB GPU), I have losses that look like:

[INFO] 0/17534 mem: 55.56/79.33G lr: 0.0000160 step_ratio: 0.0000 loss: 1.339101
[INFO] 100/17534 mem: 60.59/79.33G lr: 0.0000171 step_ratio: 0.0002 loss: 0.552416
[INFO] 200/17534 mem: 60.59/79.33G lr: 0.0000202 step_ratio: 0.0004 loss: 0.484728
[INFO] 300/17534 mem: 60.59/79.33G lr: 0.0000255 step_ratio: 0.0006 loss: 0.499739
[INFO] 400/17534 mem: 60.59/79.33G lr: 0.0000327 step_ratio: 0.0008 loss: 0.479734
[INFO] 500/17534 mem: 60.59/79.33G lr: 0.0000418 step_ratio: 0.0010 loss: 0.431447
[INFO] 600/17534 mem: 60.59/79.33G lr: 0.0000528 step_ratio: 0.0011 loss: 0.502410
[INFO] 700/17534 mem: 60.59/79.33G lr: 0.0000655 step_ratio: 0.0013 loss: 0.357406
[INFO] 800/17534 mem: 60.59/79.33G lr: 0.0000797 step_ratio: 0.0015 loss: 0.424057
[INFO] 900/17534 mem: 60.59/79.33G lr: 0.0000954 step_ratio: 0.0017 loss: 0.351256
[INFO] 1000/17534 mem: 60.59/79.33G lr: 0.0001122 step_ratio: 0.0019 loss: 0.433826
...

and predicted images that look like this:
train_pred_images_0_1500

These losses/images look very bad to me, but perhaps I need to wait for much longer.

Am I doing something wrong?

Thanks for all your help!

@YuxuanSnow
Copy link

I face the similar issue: it seems difficult to converge / generate reasonable results at early stages;

I would suggest to just overfit to one object. In my toy experiment i start from large checkpoint and still takes 20k iterations to achieve good psnr (around 27-28).

I didn't try to overfit from scratch yet. I may also try that and share here later.

One weird thing i observe:

  • pretrained model provides super decent results. At the beginning of my finetuning on single object, performance even drops. It recovers in about 1k iterations.

@YuxuanSnow
Copy link

Now i would like to provide more results regarding overfitting to one object:
Orange curve is with resume='./pretrained/model_fp16.safetensors', while Green curve has resume=None
I use constant 5e-5 learning rate for both experiments; Objective functions are unchanged.

It seems like if i start from pretrained checkpoint, it converges slowly, but loss still goes down; When i start to train from scratch, the loss doesn't go down -- i checked the splatted 3D gaussian, which is a white image and shows PSNR around 15 w.r.t. training image.

@greeneggsandyaml Do you have any update on training?

image

@ashawkey Could you kindly provide any information on how your loss curve looks like? For me it seems difficult to converge when i train from scratch.

@ashawkey
Copy link
Collaborator

ashawkey commented Apr 5, 2024

@YuxuanSnow Hi, thanks for your information!
Actually the pretrained model is trained in multiple runs due to some infra issues, so I cannot provide a complete loss curve. An observation is that LPIPS loss may harm convergence at the beginning stage of training, but does help to increase fidelity at the late stage, so you can try to disable it first.
You may refer to this issue too: #35

@yxymessi
Copy link

yxymessi commented Apr 8, 2024

Now i would like to provide more results regarding overfitting to one object: Orange curve is with resume='./pretrained/model_fp16.safetensors', while Green curve has resume=None I use constant 5e-5 learning rate for both experiments; Objective functions are unchanged.

It seems like if i start from pretrained checkpoint, it converges slowly, but loss still goes down; When i start to train from scratch, the loss doesn't go down -- i checked the splatted 3D gaussian, which is a white image and shows PSNR around 15 w.r.t. training image.

@greeneggsandyaml Do you have any update on training?

image

@ashawkey Could you kindly provide any information on how your loss curve looks like? For me it seems difficult to converge when i train from scratch.

When overfitting on a single object, have you tried to adapt the code of LR scheduler?

@YuxuanSnow
Copy link

Now i would like to provide more results regarding overfitting to one object: Orange curve is with resume='./pretrained/model_fp16.safetensors', while Green curve has resume=None I use constant 5e-5 learning rate for both experiments; Objective functions are unchanged.
It seems like if i start from pretrained checkpoint, it converges slowly, but loss still goes down; When i start to train from scratch, the loss doesn't go down -- i checked the splatted 3D gaussian, which is a white image and shows PSNR around 15 w.r.t. training image.
@greeneggsandyaml Do you have any update on training?
image
@ashawkey Could you kindly provide any information on how your loss curve looks like? For me it seems difficult to converge when i train from scratch.

When overfitting on a single object, have you tried to adapt the code of LR scheduler?

I didn't try to adapt the LR scheduler.

@ashawkey I have updated result when disable the lpips:
image

The PSNR can achieve higher value (purple curve), which means it's effective strategy to disable lpips; The image is still blury but i think adding lpips and further train could resolve the problem.

@jeremy123z
Copy link

Now i would like to provide more results regarding overfitting to one object: Orange curve is with resume='./pretrained/model_fp16.safetensors', while Green curve has resume=None I use constant 5e-5 learning rate for both experiments; Objective functions are unchanged.
It seems like if i start from pretrained checkpoint, it converges slowly, but loss still goes down; When i start to train from scratch, the loss doesn't go down -- i checked the splatted 3D gaussian, which is a white image and shows PSNR around 15 w.r.t. training image.
@greeneggsandyaml Do you have any update on training?
image
@ashawkey Could you kindly provide any information on how your loss curve looks like? For me it seems difficult to converge when i train from scratch.

When overfitting on a single object, have you tried to adapt the code of LR scheduler?

I didn't try to adapt the LR scheduler.

@ashawkey I have updated result when disable the lpips: image

The PSNR can achieve higher value (purple curve), which means it's effective strategy to disable lpips; The image is still blury but i think adding lpips and further train could resolve the problem.

u really do a wonderful job! for me,there are some questions in this paper,hope to acquire ur help!can i have ur tel or ins to talk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants