-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about model training. #5
Comments
Hi, Same question. |
Hi, same question here too and really appreciate your video! |
Hello! Given my current priorities, I don't think I'll be coding the training script anytime soon. Feel free to contribute it yourself... that's the spirit of open source and pull requests. |
I actually tried to set up a training script, but I seem run out of ram LOL. Do yall know what is an expected resources I should prepare for training this? I tried to set up the model similar to the def forward(self, images, captions, tokenizer, strength=0.8):
batch_size = len(captions)
latents_shape = (batch_size, 4, self.LATENTS_HEIGHT, self.LATENTS_WIDTH)
generator = torch.Generator(device=self.device)
tokens = tokenizer(captions, padding="max_length", max_length=77, return_tensors="pt", truncation=True).input_ids.to(self.device)
# tokens = torch.tensor(tokens, dtype=torch.long, device=self.device)
context = self.clip(tokens)
sampler = DDPMSampler(generator)
sampler.set_inference_timesteps(self.n_inference_steps)
if images is not None:
encoder_noise = torch.randn(latents_shape, generator=generator, device=self.device)
latents = self.encoder(images, encoder_noise)
sampler.set_strength(strength)
latents = sampler.add_noise(latents, sampler.timesteps[0])
else:
latents = torch.randn(latents_shape, generator=generator, device=self.device)
timesteps = sampler.timesteps
for timestep in timesteps:
time_embedding = self.get_time_embedding(timestep).to(self.device)
model_input = latents
model_output = self.diffusion(model_input, context, time_embedding)
latents = sampler.step(timestep, latents, model_output)
return self.rescale(self.decoder(latents), (-1, 1), (0, 1), clamp=True), context But even just feeding one image per batch [1, 3, 512, 512] to this forward, it ran out of memory. |
did u try to resize your images to a smaller one, if u got still out of ram error, maybe u can run on Colab. |
I watched all your videos and followed along, it tooks about 5 days 😀, it's very fun and appreciate you!
Now I wonder how to train this model.
I also watched another video of yours “How diffusion models work - explanation and code!”.
This is also very useful and great video, thank you again!!
The video was about how to train unet(diffusion model) for latent denosing.
But we have four major models in here:
VAE-encoder, VAE-decoder, unet, and clip
If we want to train unet(diffusion mode) like "diffusion model training youtube",
does we freeze other models and train only unet?
However, the definition of learning is not well understood.
For example, if we want to create image B with a specific style of A, like A image -> styled B image
Where should I feed images A or random(input) and styled B(output), respectively?
The inference will look like this, but I don't know how to do it in training phase.
A(or random) -> VAE-encode -> [ z, clip-emb, time-emb -> unet -> z] * loop -> VAE-decode -> B
It is also questionable whether clip-embeding should just be left blank or random or specific text prompt?.
or should I input A image for clip-embeding?
I have searched on youtube for that how people train stable diffusion model then most video was using dreambooth.
It looks very hight level again like hugging face.
I would like to know exact concept and what happen under the hood.
Thanks to your video code I could understand stable diffusion ddpm model but I want to expand training concept.
Thank you for amazing works!
Happy new year!
The text was updated successfully, but these errors were encountered: