vae bf16 training loss nan #265

lbwang2006 · 2024-04-28T01:43:22Z

vae bf16 training loss nan, pytorch_lighting, how to solve this

LinB203 · 2024-04-28T05:06:48Z

Do you enable the gan loss? We also meet it, it will happen after ~30-50k steps. But it does not matter, just resume it.

lbwang2006 · 2024-04-28T10:05:17Z

Do you enable the gan loss? We also meet it, it will happen after ~30-50k steps. But it does not matter, just resume it.

yes, I enable the gan loss, and the loss is nan, and does not get better.
only restart training script with the model with the latest good checkpoint?

lbwang2006 · 2024-04-28T10:09:55Z

and is gan loss necessary if it is easy to lead nan loss？

qqingzheng · 2024-04-29T00:52:22Z

and is gan loss necessary if it is easy to lead nan loss？

The GAN loss plays a crucial role in preserving high-frequency information and should not be omitted.

LinB203 · 2024-04-29T01:26:45Z

In v1.0.0 we didn't use gan loss. In v1.1.0 vae's capabilities will be vastly improved.

lbwang2006 · 2024-04-29T05:37:05Z

In v1.0.0 we didn't use gan loss. In v1.1.0 vae's capabilities will be vastly improved.

I found the config in the current causalvae, loss type is
opensora.models.ae.videobase.losses.LPIPSWithDiscriminator, I think gan loss has already been used?

qqingzheng · 2024-04-29T05:46:41Z

In v1.0.0 we didn't use gan loss. In v1.1.0 vae's capabilities will be vastly improved.

I found the config in the current causalvae, loss type is

opensora.models.ae.videobase.losses.LPIPSWithDiscriminator, I think gan loss has already been used?

Sorry for that. Due to a previous code refactoring, the config.json file was added after the training of the released causalvae. It is sure that the release model was trained without the use of a GAN.

antonioo-c · 2024-04-30T06:43:33Z

Thanks for the great project. I wonder when will you release the new version of training code?

LinB203 · 2024-05-01T02:30:40Z

This month.

Thanks for the great project. I wonder when will you release the new version of training code?

awei-6 · 2024-05-17T17:04:00Z

https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/bec0e8523840f34cd7e687cb6fe6fb92ba3f991c/opensora/models/ae/videobase/losses/perceptual_loss.py#L95C1-L95C88

The nll_grads is easy to exceed the maximum precision that bf16 can represent, it is recommended not to use amp training and use float32 training.

lbwang2006 · 2024-05-24T02:09:12Z

In v1.0.0 we didn't use gan loss. In v1.1.0 vae's capabilities will be vastly improved.

but I found loss.discrimator in the v1.1.0 vae weight....

LinB203 assigned qqingzheng Apr 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vae bf16 training loss nan #265

vae bf16 training loss nan #265

lbwang2006 commented Apr 28, 2024

LinB203 commented Apr 28, 2024 •

edited

lbwang2006 commented Apr 28, 2024

lbwang2006 commented Apr 28, 2024

qqingzheng commented Apr 29, 2024

LinB203 commented Apr 29, 2024

lbwang2006 commented Apr 29, 2024

qqingzheng commented Apr 29, 2024

antonioo-c commented Apr 30, 2024

LinB203 commented May 1, 2024

awei-6 commented May 17, 2024

lbwang2006 commented May 24, 2024

vae bf16 training loss nan #265

vae bf16 training loss nan #265

Comments

lbwang2006 commented Apr 28, 2024

LinB203 commented Apr 28, 2024 • edited

lbwang2006 commented Apr 28, 2024

lbwang2006 commented Apr 28, 2024

qqingzheng commented Apr 29, 2024

LinB203 commented Apr 29, 2024

lbwang2006 commented Apr 29, 2024

qqingzheng commented Apr 29, 2024

antonioo-c commented Apr 30, 2024

LinB203 commented May 1, 2024

awei-6 commented May 17, 2024

lbwang2006 commented May 24, 2024

LinB203 commented Apr 28, 2024 •

edited