Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the importance of DDIM? #96

Open
vedantroy opened this issue Oct 18, 2023 · 1 comment
Open

What is the importance of DDIM? #96

vedantroy opened this issue Oct 18, 2023 · 1 comment

Comments

@vedantroy
Copy link

vedantroy commented Oct 18, 2023

I am working on integrating this repository with ComfyUI, so this project can get support for ControlNet, LoRA, etc. all for free from ComfyUI's infrastructure.

The main issue is, I need to understand exactly how the DDIM sampler was hacked.
From reading the paper, it looks like it was patched to support cross-frame attention.
This makes me guess that DDIM is not of importance here. i.e., the standard diffusion schedule could be used as well if it was patched. Is this an incorrect assumption?

@vedantroy vedantroy changed the title What is the importance of DDIM / where is the original DDIM source? What is the importance of DDIM? Oct 18, 2023
@williamyang1991
Copy link
Owner

Thank you for your interest.

Almost not important.
In DDIM, we use two parts.
image
One is that it can predict a noised version $\hat{x}_{t\rightarrow0}$, which we can warp it.
One is the adding noise process. We rescale the noise level so that the fuse of two noisy latent has the same noise level as before fusing.

img = img_ref * weight + (1. - weight) * (
img - dir_xt) + rescale * dir_xt

We have two latent img_ref (encoded warped image $\tilde{x}_{t-1}$) and img ($x_{t-1}$), they both have the same noise level of $\sqrt{1-\alpha_{t-1}}$ (standard deviation).
We would like to fuse them with a soft weight (ranging 0~1, $M$, in the paper we use a hard binary mask rather than this soft one in this repository. Here we use the soft one to prevent error accumulation),
then the resulting img_ref * weight + (1. - weight) * img ($(1-M)\tilde{x}_{t-1}+Mx_{t-1}$) has the noise level of $\sqrt{(M^2+(1-M)^2)(1-\alpha_{t-1})}$ as standard deviation, which is smaller than $\sqrt{1-\alpha_{t-1}}$ if $M$ is not 0 or 1. In this case, the final image reults will be burry.

So I rescale the dir_xt (direction pointing to $x_{t-1}$) so make the final fused result still have a noise level of $\sqrt{1-\alpha_{t-1}}$.

These two points are where we use the DDIM.
If you are using other schedule, you need to find its predicted $\hat{x}_{t \rightarrow 0}$ to warp and don't forget to somehow rescale the noise level when fusing $\tilde{x}_{t-1}$ and $x_{t-1}$ (in different schedules, the denifition of dir_xt may different.).

For other parts, I think DDIM is not important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants