Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some code for SVD improvement, and can we please fix it so we can do this 'whole_batch' image input on a WAS node again for SVD with this code? #3428

Closed
brentjohnston opened this issue May 8, 2024 · 0 comments

Comments

@brentjohnston
Copy link

brentjohnston commented May 8, 2024

Wall of text sort of required, sorry. Attached is some code if anyone wants to improve SVD results. But this still only works on commit 6c3fed70655b737dc9b59da1cadb3c373c08d8ed

I would fork all of this, but just had a baby so don't really have time anymore. The attached code is made by a redditor would allow me to do an entire batch of 48 images into the latent input of a ksampler and process those image as a whole batch better SVD video generation.

It would greatly enhance the video. I would create 48 images and pluck one out of thr batch to use as the init_image and use the rest for whole_batch input. Using latent blending or latent add / multiply with it and the latent output of svd_img2vid_comditioning.

It does not work on newer commits though and really wanting to update my comfyui in the future if possible, if anyone can let me know why it won't work.

The default WAS load image batch node just does 48 separate images when I test it now when basic batch is chosen, and whole_batch does the same on new commit. So this is needed.

I looked into this code and it has a pil2tensor_stack function, this function converts multiple PIL images to a tensor stack, a "4D tensor with a new batch dimension."

Each image is individually converted to a tensor, then all are stacked together along a new dimension. The whole_batch condition uses the new pil2tensor_stack function to process all images from a directory and combine them into one tensor. This is all done to ensures all of your images are processed simultaneously and maintains relative structure within the unified batch.

If anyone is interested in trying this out
Here is the modified WAS_Node_Suite.py used for comfyui commit above that does this.
WAS_Node_Suite.zip

and the .diff to show what lines were changed or if you want to do it yourself manually instead.

Whole_batch_WAS_Node.zip

As I said I did not make this whole batch latent input modification, but do have a version with batch images input on left side to make it more seamless. Some comments from the reddit guy that originally made this.

The node is in https://github.com/WASasquatch/was-node-suite-comfyui

I've put a diff to the current main commit 6c3fed70655b737dc9b59da1cadb3c373c08d8ed here https://pastebin.com/fVuZxExF

The workflow is here: https://pastebin.com/SrgYmtBX

His workflow did not work too well for me tbh, until I tried with latent blend, or latent add, and a latent multiply with main latent and second refining powernoise ksampler. I will post my workflow and sample video when I am more satisfied with it here.

In addition, I guess I might as well post some other stuff that helped me. Attached below a heavily modified comfy-SVDtools that uses pytorch attention instead of xformers I made, which also greatly enhances SVD results. It's not perfect though and probably some things that may not be exactly right.

The default extension actually wouldn't work for me even with xformers enabled in comfyui, but this helps a ton when the timestep_embedding_frames set slightly lower than video frames.

Comfy-SVDTools-modification-w-pytorch.zip

I also recommend doing some latent blends or latent adds abd latebt multiplies of the 48 image batch into a second powernoise ksampler with medium denoise settings for refinement, experiment with first also with latent blend nodes.

For some reason though, the WAS powernoise ksampler seems to help a lot here, but just regular one on first (haven't tried nvidia align your steps quite yet). This can probably do higher than 48 frames, but I only tried 100 frames and ram out of VRAM.

Perturbed guidance (the advanced one with more options) I like to put in 5 times and set to block 0 - 5 on the output also seems to help a lot also, and clean up a lot of cloudy distortions. But this really makes things a ton slower so I only do block 0 and 1 sometimes. None of this is Sora level here of course but what I see is comparable to https://github.com/HVision-NKU/StoryDiffusion but more clear and longer video. (I get a lot of facial expressions and body movements somehow, I always thought SVD would remain static.) Oh, and merging SVD models fails but it somehow still merges a few of the blocks or something in there is merging because I get better results on the merged SVD models I use, so that could make a difference.

Try lower videolinearcfgguidance like 0.9-2, and messing with the timestep_embedding_frames setting in the modified SVD tools patch I've provided. It seems to have a big impact. Also video linear cfg as high as 4 and lowering the timestep_embedding_frames can give interesting results.

Basically just wanting to get this info out therr and also when Im more available finally update my comfyui someday, but can't because this WAS node whole_batch code makes a huge difference for me.

If this helps you or anyone improves please let me know. I didn't really have time to fork these as just had a baby as I was saying. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant