Issue with training speed / loss #46

greeneggsandyaml · 2024-04-02T23:47:18Z

Hello, I'm looking to replicate the results of this repo. I've loaded the Objaverse data (rendered in a similar manner to G-Objaverse) and I've verified that the images look right (see below). I believe that the cameras are also being loaded correctly, although it is always possible that I made an error there.

I'm finding that the network does not train successfully.

I am asking anyone (the author or anyone else who has successfully trained a model), what the training process should look like. As in, what (approximately) should the loss be at 500, 1000, 5000 steps? Does the network take forever to converge or is something wrong with my setup?

For context, my renders look like:

And after 1500 steps of training (with a single 80GB GPU), I have losses that look like:

[INFO] 0/17534 mem: 55.56/79.33G lr: 0.0000160 step_ratio: 0.0000 loss: 1.339101
[INFO] 100/17534 mem: 60.59/79.33G lr: 0.0000171 step_ratio: 0.0002 loss: 0.552416
[INFO] 200/17534 mem: 60.59/79.33G lr: 0.0000202 step_ratio: 0.0004 loss: 0.484728
[INFO] 300/17534 mem: 60.59/79.33G lr: 0.0000255 step_ratio: 0.0006 loss: 0.499739
[INFO] 400/17534 mem: 60.59/79.33G lr: 0.0000327 step_ratio: 0.0008 loss: 0.479734
[INFO] 500/17534 mem: 60.59/79.33G lr: 0.0000418 step_ratio: 0.0010 loss: 0.431447
[INFO] 600/17534 mem: 60.59/79.33G lr: 0.0000528 step_ratio: 0.0011 loss: 0.502410
[INFO] 700/17534 mem: 60.59/79.33G lr: 0.0000655 step_ratio: 0.0013 loss: 0.357406
[INFO] 800/17534 mem: 60.59/79.33G lr: 0.0000797 step_ratio: 0.0015 loss: 0.424057
[INFO] 900/17534 mem: 60.59/79.33G lr: 0.0000954 step_ratio: 0.0017 loss: 0.351256
[INFO] 1000/17534 mem: 60.59/79.33G lr: 0.0001122 step_ratio: 0.0019 loss: 0.433826
...

and predicted images that look like this:

These losses/images look very bad to me, but perhaps I need to wait for much longer.

Am I doing something wrong?

Thanks for all your help!

The text was updated successfully, but these errors were encountered:

YuxuanSnow · 2024-04-03T20:43:35Z

I face the similar issue: it seems difficult to converge / generate reasonable results at early stages;

I would suggest to just overfit to one object. In my toy experiment i start from large checkpoint and still takes 20k iterations to achieve good psnr (around 27-28).

I didn't try to overfit from scratch yet. I may also try that and share here later.

One weird thing i observe:

pretrained model provides super decent results. At the beginning of my finetuning on single object, performance even drops. It recovers in about 1k iterations.

YuxuanSnow · 2024-04-04T12:35:17Z

Now i would like to provide more results regarding overfitting to one object:
Orange curve is with resume='./pretrained/model_fp16.safetensors', while Green curve has resume=None
I use constant 5e-5 learning rate for both experiments; Objective functions are unchanged.

It seems like if i start from pretrained checkpoint, it converges slowly, but loss still goes down; When i start to train from scratch, the loss doesn't go down -- i checked the splatted 3D gaussian, which is a white image and shows PSNR around 15 w.r.t. training image.

@greeneggsandyaml Do you have any update on training?

@ashawkey Could you kindly provide any information on how your loss curve looks like? For me it seems difficult to converge when i train from scratch.

ashawkey · 2024-04-05T05:07:52Z

@YuxuanSnow Hi, thanks for your information!
Actually the pretrained model is trained in multiple runs due to some infra issues, so I cannot provide a complete loss curve. An observation is that LPIPS loss may harm convergence at the beginning stage of training, but does help to increase fidelity at the late stage, so you can try to disable it first.
You may refer to this issue too: #35

yxymessi · 2024-04-08T07:20:27Z

Now i would like to provide more results regarding overfitting to one object: Orange curve is with resume='./pretrained/model_fp16.safetensors', while Green curve has resume=None I use constant 5e-5 learning rate for both experiments; Objective functions are unchanged.

It seems like if i start from pretrained checkpoint, it converges slowly, but loss still goes down; When i start to train from scratch, the loss doesn't go down -- i checked the splatted 3D gaussian, which is a white image and shows PSNR around 15 w.r.t. training image.

@greeneggsandyaml Do you have any update on training?

@ashawkey Could you kindly provide any information on how your loss curve looks like? For me it seems difficult to converge when i train from scratch.

When overfitting on a single object, have you tried to adapt the code of LR scheduler?

YuxuanSnow · 2024-04-08T08:37:25Z

Now i would like to provide more results regarding overfitting to one object: Orange curve is with resume='./pretrained/model_fp16.safetensors', while Green curve has resume=None I use constant 5e-5 learning rate for both experiments; Objective functions are unchanged.
It seems like if i start from pretrained checkpoint, it converges slowly, but loss still goes down; When i start to train from scratch, the loss doesn't go down -- i checked the splatted 3D gaussian, which is a white image and shows PSNR around 15 w.r.t. training image.
@greeneggsandyaml Do you have any update on training?

@ashawkey Could you kindly provide any information on how your loss curve looks like? For me it seems difficult to converge when i train from scratch.

When overfitting on a single object, have you tried to adapt the code of LR scheduler?

I didn't try to adapt the LR scheduler.

@ashawkey I have updated result when disable the lpips:

The PSNR can achieve higher value (purple curve), which means it's effective strategy to disable lpips; The image is still blury but i think adding lpips and further train could resolve the problem.

jeremy123z · 2024-05-08T08:57:18Z

Now i would like to provide more results regarding overfitting to one object: Orange curve is with resume='./pretrained/model_fp16.safetensors', while Green curve has resume=None I use constant 5e-5 learning rate for both experiments; Objective functions are unchanged.
It seems like if i start from pretrained checkpoint, it converges slowly, but loss still goes down; When i start to train from scratch, the loss doesn't go down -- i checked the splatted 3D gaussian, which is a white image and shows PSNR around 15 w.r.t. training image.
@greeneggsandyaml Do you have any update on training?

@ashawkey Could you kindly provide any information on how your loss curve looks like? For me it seems difficult to converge when i train from scratch.

When overfitting on a single object, have you tried to adapt the code of LR scheduler?

I didn't try to adapt the LR scheduler.

@ashawkey I have updated result when disable the lpips:

The PSNR can achieve higher value (purple curve), which means it's effective strategy to disable lpips; The image is still blury but i think adding lpips and further train could resolve the problem.

u really do a wonderful job! for me,there are some questions in this paper,hope to acquire ur help!can i have ur tel or ins to talk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with training speed / loss #46

Issue with training speed / loss #46

greeneggsandyaml commented Apr 2, 2024

YuxuanSnow commented Apr 3, 2024

YuxuanSnow commented Apr 4, 2024

ashawkey commented Apr 5, 2024

yxymessi commented Apr 8, 2024

YuxuanSnow commented Apr 8, 2024

jeremy123z commented May 8, 2024

Issue with training speed / loss #46

Issue with training speed / loss #46

Comments

greeneggsandyaml commented Apr 2, 2024

YuxuanSnow commented Apr 3, 2024

YuxuanSnow commented Apr 4, 2024

ashawkey commented Apr 5, 2024

yxymessi commented Apr 8, 2024

YuxuanSnow commented Apr 8, 2024

jeremy123z commented May 8, 2024