Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vae bf16 training loss nan #265

Open
lbwang2006 opened this issue Apr 28, 2024 · 11 comments
Open

vae bf16 training loss nan #265

lbwang2006 opened this issue Apr 28, 2024 · 11 comments
Assignees

Comments

@lbwang2006
Copy link

vae bf16 training loss nan, pytorch_lighting, how to solve this

@LinB203
Copy link
Member

LinB203 commented Apr 28, 2024

Do you enable the gan loss? We also meet it, it will happen after ~30-50k steps. But it does not matter, just resume it.

@lbwang2006
Copy link
Author

Do you enable the gan loss? We also meet it, it will happen after ~30-50k steps. But it does not matter, just resume it.

yes, I enable the gan loss, and the loss is nan, and does not get better.
only restart training script with the model with the latest good checkpoint?

@lbwang2006
Copy link
Author

and is gan loss necessary if it is easy to lead nan loss?

@qqingzheng
Copy link
Contributor

and is gan loss necessary if it is easy to lead nan loss?

The GAN loss plays a crucial role in preserving high-frequency information and should not be omitted.

@LinB203
Copy link
Member

LinB203 commented Apr 29, 2024

In v1.0.0 we didn't use gan loss. In v1.1.0 vae's capabilities will be vastly improved.

@lbwang2006
Copy link
Author

In v1.0.0 we didn't use gan loss. In v1.1.0 vae's capabilities will be vastly improved.

I found the config in the current causalvae, loss type is
opensora.models.ae.videobase.losses.LPIPSWithDiscriminator, I think gan loss has already been used?

@qqingzheng
Copy link
Contributor

In v1.0.0 we didn't use gan loss. In v1.1.0 vae's capabilities will be vastly improved.

I found the config in the current causalvae, loss type is

opensora.models.ae.videobase.losses.LPIPSWithDiscriminator, I think gan loss has already been used?

Sorry for that. Due to a previous code refactoring, the config.json file was added after the training of the released causalvae. It is sure that the release model was trained without the use of a GAN.

@antonioo-c
Copy link

Thanks for the great project. I wonder when will you release the new version of training code?

@LinB203
Copy link
Member

LinB203 commented May 1, 2024

This month.

Thanks for the great project. I wonder when will you release the new version of training code?

@awei-6
Copy link

awei-6 commented May 17, 2024

https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/bec0e8523840f34cd7e687cb6fe6fb92ba3f991c/opensora/models/ae/videobase/losses/perceptual_loss.py#L95C1-L95C88

The nll_grads is easy to exceed the maximum precision that bf16 can represent, it is recommended not to use amp training and use float32 training.

@lbwang2006
Copy link
Author

In v1.0.0 we didn't use gan loss. In v1.1.0 vae's capabilities will be vastly improved.

but I found loss.discrimator in the v1.1.0 vae weight....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants