Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Tensor size mismatch when trying to generate video of different size #177

Open
2 tasks done
adhityaswami opened this issue Jun 16, 2023 · 2 comments
Open
2 tasks done
Labels
bug Something isn't working needs testing

Comments

@adhityaswami
Copy link

Is there an existing issue for this?

  • I have searched the existing issues and checked the recent builds/commits of both this extension and the webui

Are you using the latest version of the extension?

  • I have the modelscope text2video extension updated to the lastest version and I still have the issue.

What happened?

I tried generating a video with 384x216 dimensions (16:9) aspect ratio basically with my custom trained converted model. However I get the following error:

DDIM sampling: 0%| | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last): | 0/50 [00:00<?, ?it/s]
File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/t2v_helpers/render.py", line 27, in run
vids_pack = process_modelscope(args_dict)
File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/process_modelscope.py", line 209, in process_modelscope
samples, _ = pipe.infer(args.prompt, args.n_prompt, args.steps, args.frames, args.seed + batch if args.seed != -1 else -1, args.cfg_scale,
File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_pipeline.py", line 258, in infer
x0 = self.diffusion.ddim_sample_loop(
File "/home/ubuntu/text2vid/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 1485, in ddim_sample_loop
xt = self.ddim_sample(xt, t, model, model_kwargs, clamp,
File "/home/ubuntu/text2vid/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 1334, in ddim_sample
_, _, _, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp,
File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 1275, in p_mean_variance
y_out = model(xt, self._scale_timesteps(t), **model_kwargs[0])
File "/home/ubuntu/text2vid/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 380, in forward
x = torch.cat([x, xs.pop()], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.
Exception occurred: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.

This occurs even when using the original model.

Steps to reproduce the problem

  1. Go to the UI
  2. Try generating a video with width = 384 and height = 216

What should have happened?

Should be generating a video of the required dimensions.

WebUI and Deforum extension Commit IDs

webui commit id - baf6946e06249c5af9851c60171692c44ef633e0
txt2vid commit id - a44078d

Torch version

2.0.1+cu118

What GPU were you using for launching?

NVIDIA A10G - 24GB

On which platform are you launching the webui backend with the extension?

Cloud server (Linux)

Settings

image

Console logs

################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye)
################################################################

################################################################
Running on ubuntu user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
python venv already activate: /home/ubuntu/text2vid/stable-diffusion-webui/venv
################################################################

################################################################
Launching launch.py...
################################################################
Using TCMalloc: libtcmalloc.so.4
Python 3.10.9 (main, Mar  1 2023, 18:23:06) [GCC 11.2.0]
Version: v1.3.2
Commit hash: baf6946e06249c5af9851c60171692c44ef633e0
Installing requirements

Launching Web UI with arguments: --listen
No module 'xformers'. Proceeding without it.
Loading weights [6ce0161689] from /home/ubuntu/text2vid/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors
Creating model from config: /home/ubuntu/text2vid/stable-diffusion-webui/configs/v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 4.4s (import torch: 0.9s, import gradio: 0.9s, import ldm: 0.4s, other imports: 0.8s, load scripts: 0.5s, create ui: 0.6s, gradio launch: 0.1s).
DiffusionWrapper has 859.52 M params.
Applying optimization: Doggettx... done.
Textual inversion embeddings loaded(0):
Model loaded in 1.7s (load weights from disk: 0.2s, create model: 0.9s, apply weights to model: 0.2s, apply half(): 0.1s, move model to device: 0.2s).
text2video — The model selected is:  ModelScope
 text2video extension for auto1111 webui
Git commit: a44078d1
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
  0%|                                                                  | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'Blonde woman walking in a forest, dense foliage, pink leaves', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 40, 'seed': 3586594887, 'scale': 17, 'width': 384, 'height': 216, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 0}
latents torch.Size([1, 4, 40, 27, 48]) tensor(-0.0010, device='cuda:0') tensor(0.9960, device='cuda:0')
DDIM sampling:   0%|                                                  | 0/31 [00:00<?, ?it/s]
Traceback (most recent call last):                                    | 0/31 [00:00<?, ?it/s]
  File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/t2v_helpers/render.py", line 27, in run
    vids_pack = process_modelscope(args_dict)
  File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/process_modelscope.py", line 209, in process_modelscope
    samples, _ = pipe.infer(args.prompt, args.n_prompt, args.steps, args.frames, args.seed + batch if args.seed != -1 else -1, args.cfg_scale,
  File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_pipeline.py", line 258, in infer
    x0 = self.diffusion.ddim_sample_loop(
  File "/home/ubuntu/text2vid/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 1485, in ddim_sample_loop
    xt = self.ddim_sample(xt, t, model, model_kwargs, clamp,
  File "/home/ubuntu/text2vid/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 1334, in ddim_sample
    _, _, _, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp,
  File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 1275, in p_mean_variance
    y_out = model(xt, self._scale_timesteps(t), **model_kwargs[0])
  File "/home/ubuntu/text2vid/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/text2vid/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 380, in forward
    x = torch.cat([x, xs.pop()], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.
Exception occurred: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.

Additional information

No response

@adhityaswami adhityaswami added the bug Something isn't working label Jun 16, 2023
@B34STW4RS
Copy link

I don't think this is a bug, this is how SD worked before. The problem here it is setting the torch.size to an odd number, in this instance 27. Which is indivisible by 4. Best to use the slider to choose a resolution close to what you need and either crop it or squeeze it. I'm not sure what was changed in SD to support odd sizes, or when the change was implemented exactly.

ie: try to make a 720 wide video

Working in txt2vid mode
0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': '', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 24, 'seed': 2563507479, 'scale': 17, 'width': 720, 'height': 256, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 0}
latents torch.Size([1, 4, 24, 32, 90]) tensor(-0.0008, device='cuda:0') tensor(0.9997, device='cuda:0')
DDIM sampling: 0%| | 0/31 [00:00<?, ?it/s]
Traceback (most recent call last): | 0/31 [00:00<?, ?it/s]
File "D:\NasD\stable-diffusion-webui/extensions/sd-webui-modelscope-text2video/scripts\t2v_helpers\render.py", line 24, in run
vids_pack = process_modelscope(args_dict)
File "D:\NasD\stable-diffusion-webui/extensions/sd-webui-modelscope-text2video/scripts\modelscope\process_modelscope.py", line 205, in process_modelscope
samples, _ = pipe.infer(args.prompt, args.n_prompt, args.steps, args.frames, args.seed + batch if args.seed != -1 else -1, args.cfg_scale,
File "D:\NasD\stable-diffusion-webui/extensions/sd-webui-modelscope-text2video/scripts\modelscope\t2v_pipeline.py", line 253, in infer
x0 = self.diffusion.ddim_sample_loop(
File "D:\NasD\stable-diffusion-webui\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "D:\NasD\stable-diffusion-webui\extensions\sd-webui-modelscope-text2video\scripts\modelscope\t2v_model.py", line 1475, in ddim_sample_loop
xt = self.ddim_sample(xt, t, model, model_kwargs, clamp,
File "D:\NasD\stable-diffusion-webui\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "D:\NasD\stable-diffusion-webui\extensions\sd-webui-modelscope-text2video\scripts\modelscope\t2v_model.py", line 1324, in ddim_sample
_, _, _, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp,
File "D:\NasD\stable-diffusion-webui\extensions\sd-webui-modelscope-text2video\scripts\modelscope\t2v_model.py", line 1265, in p_mean_variance
y_out = model(xt, self._scale_timesteps(t), **model_kwargs[0])
File "D:\NasD\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\NasD\stable-diffusion-webui\extensions\sd-webui-modelscope-text2video\scripts\modelscope\t2v_model.py", line 380, in forward
x = torch.cat([x, xs.pop()], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 24 but got size 23 for tensor number 1 in the list.
Exception occurred: Sizes of tensors must match except in dimension 1. Expected size 24 but got size 23 for tensor number 1 in the list.

size is now 90, NG. etc.

@adhityaswami
Copy link
Author

Hey looks like you were right. It does work in SD normally though, so I'll check out what the change is and try to implement it in the extension as well.

tl;dr for anyone facing this issue: Make sure your resolutions are divisible by 32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs testing
Projects
None yet
Development

No branches or pull requests

3 participants