Skip to content

soraw-ai/Awesome-Text-to-Video-Generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 

Repository files navigation

Awesome-Text-to-Video-Generation Awesome

A curated (continually updated) list of Text-to-Video studies. It's based on our survey paper: From Sora What We Can See: A Survey of Text-to-Video Generation. In this survey, We have conducted a comprehensive exploration of existing works in the Text-to-Video field using OpenAI’s Sora as a clue, and we have also summarized 24 datasets and 9 evaluation metrics in this field. Specifically, we discussed the problems existing in this research area and Sora itself, combined with the advantages of Sora and the characteristics of related fields to provide future research directions. If our work can inspire you, feel free to cite our paper and star our repo.

This project is curated and maintained by Rui Sun and Yumin Zhang.

@article{sun2024sora,
  title={From Sora What We Can See: A Survey of Text-to-Video Generation},
  author={Sun, Rui and Zhang, Yumin and Shah, Tejal and Sun, Jiahao and Zhang, Shuoying and Li, Wenqi and Duan, Haoran and Wei, Bo and Ranjan, Rajiv},
  journal={arXiv preprint arXiv:2405.10674},
  year={2024}
}

Topics of this repo cover:
Text-to-Seq-Image, Text-to-Video

Table of Content

Text-to-Seq-Image

  • LivePhoto: Real Image Animation with Text-guided Motion Control
    Team: HKU, Alibaba Group, Ant Group.
    Xi Chen, Zhiheng Liu, Mengting Chen, et al., Hengshuang Zhao
    arXiv, 2023.12 [Paper], [PDF], [Code], [Demo (Video)], [Home Page]
  • Scalable Diffusion Models with Transformers Sequential Images
    Team: UC Berkeley, NYU.
    William Peebles, Saining Xie
    ICCV'23(Oral), arXiv, 2022.12 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]

Text-to-Video

  • Video generation models as world simulators
    Team: Sora, Open AI.
    Tim Brooks, Bill Peebles, Connor Homes, et al., Aditya Ramesh
    online page, 2024.02 [Paper], [Home Page]
  • ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation
    Team: University of Waterloo.
    Weiming Ren, Harry Yang, Ge Zhang, et al., Wenhu Chen
    arXiv, 2024.02 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
  • World Model on Million-Length Video And Language With RingAttention Long Video
    Team: UC Berkeley.
    Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel
    arXiv, 2024.02 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
  • 360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model
    Team: Peking University.
    Qian Wang, Weiqi Li, Chong Mou, et al., Jian Zhang
    arXiv, 2024.01 [Paper], [PDF], [Code], [Home Page]
  • MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation
    Team: Bytedance Inc.
    Weimin Wang, Jiawei Liu, Zhijie Lin, et al., Jiashi Feng
    arXiv, 2024.01 [Paper], [PDF], [Home Page]
  • UniVG: Towards UNIfied-modal Video Generation
    Team: Baidu Inc.
    Ludan Ruan, Lei Tian, Chuanwei Huang, et al., Xinyan Xiao
    arXiv, 2024.01 [Paper], [PDF], [Home Page]
  • VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM
    Team: HiDream.ai Inc.
    Fuchen Long, Zhaofan Qiu, Ting Yao and Tao Mei
    arXiv, 2024.01 [Paper], [PDF], [Home Page]
  • VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
    Team: Tencent AI Lab.
    Haoxin Chen, Yong Zhang, Xiaodong Cun, et al., Ying Shan
    arXiv, 2024.01 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
  • Lumiere: A Space-Time Diffusion Model for Video Generation
    Team: Google Research, Weizmann Institute, Tel-Aviv University, Technion.
    Omer Bar-Tal, Hila Chefer, Omer Tov, et al., Inbar Mosseri
    arXiv, 2024.01 [Paper], [PDF], [Home Page]
  • DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
    Team: Fudan University, Alibaba Group, HUST, Zhejiang University.
    Yujie Wei, Shiwei Zhang, Zhiwu Qing, et al., Hongming Shan
    arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page]
  • VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
    Team: Peking University, Microsoft Research.
    Wenjing Wang, Huan Yang, Zixi Tuo, et al., Jiaying Liu
    arXiv, 2023.12 [Paper], [PDF]
  • TrailBlazer: Trajectory Control for Diffusion-Based Video Generation Training-free
    Team: Victoria University of Wellington, NVIDIA
    Wan-Duo Kurt Ma, J.P. Lewis, W. Bastiaan Kleijn
    arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(video)]
  • FreeInit: Bridging Initialization Gap in Video Diffusion Models Training-free
    Team: Nanyang Technological University
    Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu
    arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(live)], [Demo(video)]
  • MTVG : Multi-text Video Generation with Text-to-Video Models Training-free
    Team: Korea University, NVIDIA
    Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, et al., Sangpil Kim
    arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(video)]
  • A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
    Team: HUST, Alibaba Group, Zhejiang University, Ant Group
    Xiang Wang, Shiwei Zhang, Hangjie Yuan, et al., Nong Sang
    arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page]
  • InstructVideo: Instructing Video Diffusion Models with Human Feedback
    Team: Zhejiang University, Alibaba Group, Tsinghua University
    Hangjie Yuan, Shiwei Zhang, Xiang Wang, et al., Dong Ni
    arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page]
  • VideoLCM: Video Latent Consistency Model
    Team: HUST, Alibaba Group, SJTU
    Xiang Wang, Shiwei Zhang, Han Zhang, et al., Nong Sang
    arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page]
  • Photorealistic Video Generation with Diffusion Models
    Team: Stanford University Fei-Fei Li, Google.
    Agrim Gupta, Lijun Yu, Kihyuk Sohn, et al., José Lezama
    arXiv, 2023.12 [Paper], [PDF], [Home Page]
  • Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
    Team: HUST, Alibaba Group, Fudan University.
    Zhiwu Qing, Shiwei Zhang, Jiayu Wang, et al., Nong Sang
    arXiv, 2023.12 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
  • GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation
    Team: HKU, Meta.
    Shoufa Chen, Mengmeng Xu, Jiawei Ren, et al., Juan-Manuel Perez-Rua
    arXiv, 2023.12 [Paper], [PDF], [Home Page]
  • StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter
    Team: Tsinghua University, Tencent AI Lab, CUHK.
    Gongye Liu, Menghan Xia, Yong Zhang, et al., Ying Shan
    arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(live)]
  • GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation Multimodal
    Team: Tencent.
    Zhanyu Wang, Longyue Wang, Zhen Zhao, et al., Zhaopeng Tu
    arXiv, 2023.11 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
  • F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis Training-free
    Team: University of Electronic Science and Technology of China.
    Sitong Su, Jianzhi Liu, Lianli Gao, Jingkuan Song
    arXiv, 2023.11 [Paper], [PDF]
  • AdaDiff: Adaptive Step Selection for Fast Diffusion Training-free
    Team: Fudan University.
    Hui Zhang, Zuxuan Wu, Zhen Xing, Jie Shao, Yu-Gang Jiang
    arXiv, 2023.11 [Paper], [PDF]
  • FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax Training-free
    Team: University of Technology Sydney.
    Yu Lu, Linchao Zhu, Hehe Fan, Yi Yang
    arXiv, 2023.11 [Paper], [PDF], [Code(coming)], [Home Page]
  • GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning Training-free
    Team: Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences.
    Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, et al., Shifeng Chen
    arXiv, 2023.11 [Paper], [PDF], [Code(coming)], [Home Page]
  • MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation
    Team: University of Science and Technology of China, MSRA, Xi'an Jiaotong University.
    Yanhui Wang, Jianmin Bao, Wenming Weng, et al., Baining Guo
    arXiv, 2023.11 [Paper], [PDF], [Home Page], [Demo(video)]
  • FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation
    Team: University of Science and Technology of China, MSRA, Xi'an Jiaotong University.
    Yuanxin Liu, Lei Li, Shuhuai Ren, et al., Lu Hou
    arXiv, 2023.11 [Paper], [PDF], [Code], [Dataset]
  • ART⋅V: Auto-Regressive Text-to-Video Generation with Diffusion Models
    Team: University of Science and Technology of China, Microsoft.
    Wenming Weng, Ruoyu Feng, Yanhui Wang, et al., Zhiwei Xiong
    arXiv, 2023.11 [Paper], [PDF], [Code(coming)], [Home Page], [Demo(video)]
  • Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
    Team: Stability AI.
    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, et al., Robin Rombach
    arXiv, 2023.11 [Paper], [PDF], [Code]
  • FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline
    Team: Sber AI.
    Vladimir Arkhipkin, Zein Shaheen, Viacheslav Vasilev, et al., Denis Dimitrov
    arXiv, 2023.11 [Paper], [PDF], [Code], [Home Page], [Demo(live)]
  • MoVideo: Motion-Aware Video Generation with Diffusion Models
    Team: ETH, Meta.
    Jingyun Liang, Yuchen Fan, Kai Zhang, et al., Rakesh Ranjan
    arXiv, 2023.11 [Paper], [PDF], [Home Page]
  • Optimal Noise pursuit for Augmenting Text-to-Video Generation
    Team: Zhejiang Lab.
    Shijie Ma, Huayi Xu, Mengjian Li, et al., Yaxiong Wang
    arXiv, 2023.11 [Paper], [PDF]
  • Make Pixels Dance: High-Dynamic Video Generation
    Team: ByteDance.
    Yan Zeng, Guoqiang Wei, Jiani Zheng, et al., Hang Li
    arXiv, 2023.11 [Paper], [PDF], [Home Page], [Demo(video)]
  • Learning Universal Policies via Text-Guided Video Generation
    Team: MIT, Google DeepMind, UC Berkeley.
    Yilun Du, Mengjiao Yang, Bo Dai, et al., Pieter Abbeel
    NeurIPS'23 (Spotlight), arXiv, 2023.11 [Paper], [PDF], [Code], [Home Page]
  • Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
    Team: Meta.
    Rohit Girdhar, Mannat Singh, Andrew Brown, et al., Ishan Misra
    arXiv, 2023.11 [Paper], [PDF], [Home Page], [Demo(live)]
  • FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling Training-free
    Team: Nanyang Technological University.
    Haonan Qiu, Menghan Xia, Yong Zhang, et al., Ziwei Liu
    ICLR'24 arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page]
  • ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation Training-free
    Team: Shanghai Artificial Intelligence Laboratory.
    Bo Peng, Xinyuan Chen, Yaohui Wang, Chaochao Lu, Yu Qiao
    arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page]
  • VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
    Team: Tencent AI Lab.
    Haoxin Chen, Menghan Xia, Yingqing He, et al., Ying Shan
    arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page]
  • SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction
    Team: Shanghai Artificial Intelligence Laboratory.
    Xinyuan Chen, Yaohui Wang, Lingjun Zhang, et al., Ziwei Liu
    arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page]
  • DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
    Team: The Chinese University of Hong Kong.
    Jinbo Xing, Menghan Xia, Yong Zhang, et al., Ying Shan
    arXiv, 2023.10 [Paper], [PDF], [Code], [Pretrained Model], [Home Page], [Demo(live)], [Demo(video)]
  • LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation
    Team: Nankai University, MEGVII Technology.
    Ruiqi Wu, Liangyu Chen, Tong Yang, et al., Xiangyu Zhang
    arXiv, 2023.10 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
  • LLM-grounded Video Diffusion Models Training-free
    Team: UC Berkeley.
    Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, Boyi Li
    arXiv, 2023.09 [Paper], [PDF], [Code(coming)], [Home Page]
  • VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning
    Team: UNC Chapel Hill.
    Han Lin, Abhay Zala, Jaemin Cho, Mohit Bansal
    arXiv, 2023.09 [Paper], [PDF], [Code]
  • VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation
    Team: Baidu Inc.
    Xin Li, Wenqing Chu, Ye Wu, et al., Jingdong Wang
    arXiv, 2023.09 [Paper], [PDF], [Home Page]
  • LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
    Team: Shanghai Artificial Intelligence Laboratory.
    Yaohui Wang, Xinyuan Chen, Xin Ma, et al., Ziwei Liu
    arXiv, 2023.09 [Paper], [PDF], [Code], [Home Page]
  • Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation
    Team: Huawei.
    Jiaxi Gu, Shicong Wang, Haoyu Zhao, et al., Hang Xu
    arXiv, 2023.09 [Paper], [PDF], [Code], [Home Page]
  • Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator Training-free
    Team: School of Information Science and Technology, ShanghaiTech University.
    Hanzhuo Huang, Yufan Feng, Cheng Shi, et al., Sibei Yang
    NeurIPS'24, arxiv, 2023.9[Paper], [PDF], [Home Page]
  • Show-1: Marrying pixel and latent diffusion models for text-to-video generation.
    Team: Show Lab, National University of Singapor
    David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, et al., Mike Zheng Shou
    arXiv, 2023.09 [Paper], [PDF], [Home Page],[Code], [Pretrained Model]
  • GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER
    Team: Institute of Automation, Chinese Academy of Sciences (CASIA).
    Mingzhen Sun, Weining Wang, Zihan Qin, et al., Jing Liu
    NeurIPS'23, arXiv, 2023.09 [Paper], [PDF], [Code], [Home Page], [[Demo(video)]
  • DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis Training-free
    Team: East China Normal University.
    Zhongjie Duan, Lizhou You, Chengyu Wang, et al., Jun Huang
    arXiv, 2023.08 [Paper], [PDF], [Home Page]
  • SimDA: Simple Diffusion Adapter for Efficient Video Generation
    Team: Fudan University, Microsoft.
    Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, Yu-Gang Jiang
    arXiv, 2023.08 [Paper], [PDF], [Code (Coming)], [Home Page]
  • Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models
    Team: National University of Singapore.
    Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua
    arXiv, 2023.08 [Paper], [PDF], [Code]
  • ModelScope Text-to-Video Technical Report
    Team: Alibaba Group.
    Jiuniu Wang, Hangjie Yuan, Dayou Chen, et al., Shiwei Zhang
    arXiv, 2023.08 [Paper], [PDF], [Code], [Home Page], [[Demo(live)]
  • Dual-Stream Diffusion Net for Text-to-Video Generation
    Team: Nanjing University of Science and Technology.
    Binhui Liu, Xin Liu, Anbo Dai, et al., Jian Yang
    arXiv, 2023.08 [Paper], [PDF]
  • AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
    Team: The Chinese University of Hong Kong.
    Yuwei Guo, Ceyuan Yang, Anyi Rao, et al., Bo Dai
    ICLR'24 (spotlight), arXiv, 2023.07 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
  • Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation
    Team: HKUST.
    Yingqing He, Menghan Xia, Haoxin Chen, et al., Qifeng Chen
    arXiv, 2023.07 [Paper], [PDF], [Code], [Home Page], [[Demo(video)]
  • Probabilistic Adaptation of Text-to-Video Models
    Team: Google, UC Berkeley.
    Mengjiao Yang, Yilun Du, Bo Dai, et al., Pieter Abbeel
    arXiv, 2023.06 [Paper], [PDF], [Home Page]
  • ED-T2V: An Efficient Training Framework for Diffusion-based Text-to-Video Generation
    Team: School of Artificial Intelligence, University of Chinese Academy of Sciences.
    Jiawei Liu, Weining Wang, Wei Liu, Qian He, Jing Liu
    IJCNN'23, 2023.06 [Paper], [PDF]
  • Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance
    Team: CUHK.
    Jinbo Xing, Menghan Xia, Yuxin Liu, et al., Tien-Tsin Wong
    arXiv, 2023.06 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
  • VideoComposer: Compositional Video Synthesis with Motion Controllability
    Team: Alibaba Group.
    Xiang Wang, Hangjie Yuan, Shiwei Zhang, et al., Jingren Zhou
    NeurIPS'23, arXiv, 2023.06 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
  • VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation
    Team: University of Chinese Academy of Sciences (UCAS), Alibaba Group.
    Zhengxiong Luo, Dayou Chen, Yingya Zhang, et al., Tieniu Tan
    CVPR'23, arXiv, 2023.06 [Paper], [PDF]
  • DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation Training-free
    Team: Korea University.
    Susung Hong, Junyoung Seo, Heeseong Shin, Sunghwan Hong, Seungryong Kim
    arXiv, 2023.05 [Paper], [PDF]
  • Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models
    Team: Carnegie Mellon Univeristy.
    Rohan Dhesikan, Vignesh Rajmohan
    arXiv, 2023.05 [Paper], [PDF], [Code(coming)]
  • Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models
    Team: University of Maryland.
    Songwei Ge, Seungjun Nah, Guilin Liu, et al., Yogesh Balaji
    ICCV'23, arXiv, 2023.05 [Paper], [PDF], [Home Page]
  • Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity
    Team: NUS, CUHK.
    Zijiao Chen, Jiaxin Qing, Juan Helen Zhou
    NeurIPS'24, arXiv, 2023.05 [Paper], [PDF], [Code], [Home Page]
  • VideoPoet: A Large Language Model for Zero-Shot Video Generation
    Team: Google Research
    Dan Kondratyuk, Lijun Yu, Xiuye Gu, et al., Lu Jiang
    arXiv, 2023.05 [Paper], [PDF], [Home Page], [Blog]
  • VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning
    Team: Tsinghua University, Beijing Film Academy
    Hong Chen, Xin Wang, Guanning Zeng, et al., WenwuZhu
    arXiv, 2023.05 [Paper], [PDF], [Code], [Home Page]
  • Text2Performer: Text-Driven Human Video Generation
    Team: Nanyang Technological University
    Yuming Jiang, Shuai Yang, Tong Liang Koh, et al., Ziwei Liu
    arXiv, 2023.04 [Paper], [PDF], [Code], [Home Page], [[Demo(video)]
  • Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation
    Team: University of Rochester, Meta.
    Jie An, Songyang Zhang, Harry Yang, et al., Xi Yin
    arXiv, 2023.04 [Paper], [PDF], [Home Page]
  • Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos
    Team: Tsinghua University, HKUST.
    Yue Ma, Yingqing He, Xiaodong Cun, et al., Qifeng Chen
    AAAI'24, arXiv, 2023.04 [Paper], [PDF], [Home Page], [Code]
  • Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
    Team: NVIDIA.
    Andreas Blattmann, Robin Rombach, Huan Ling, et al., Karsten Kreis
    CVPR'23, arXiv, 2023.04 [Paper], [PDF], [Home Page]
  • NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation
    Team: University of Science and Technology of China, Microsoft.
    Shengming Yin, Chenfei Wu, Huan Yang, et al. , Nan Duan
    arXiv, 2023.03 [Paper], [PDF], [Home Page]
  • Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
    Team: Picsart AI Resarch (PAIR).
    Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, et al., Humphrey Shi
    arXiv, 2023.03 [Paper], [PDF], [Code], [Home Page], [Demo(live)], [Demo(video)]
  • Structure and Content-Guided Video Synthesis with Diffusion Models
    Team: Runway
    Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, Anastasis Germanidis
    ICCV'23, arXiv, 2023.02 [Paper], [PDF], [Home Page]
  • SceneScape: Text-Driven Consistent Scene Generation
    Team: Weizmann Institute of Science, NVIDIA Research
    Rafail Fridman, Amit Abecasis, Yoni Kasten, Tali Dekel
    NeurIPS'23, arXiv, 2023.02 [Paper], [PDF], [Code], [Home Page]
  • MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
    Team: Renmin University of China, Peking University, Microsoft Research
    Ludan Ruan, Yiyang Ma, Huan Yang, et al., Baining Guo
    CVPR'23, arXiv, 2022.12 [Paper], [PDF], [Code]
  • Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
    Team: Show Lab, National University of Singapore.
    Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Mike Zheng Shou et al
    ICCV'23, arxiv, 2022.12[Paper], [PDF], [Code], [Pretrained Model]
  • MagicVideo: Efficient Video Generation With Latent Diffusion Models
    Team: ByteDance Inc.
    Daquan Zhou, Weimin Wang, Hanshu Yan, et al., Jiashi Feng
    arXiv, 2022.11 [Paper], [PDF], [Home Page]
  • Latent Video Diffusion Models for High-Fidelity Long Video Generation Long Video
    Team: HKUST, Tencent AI Lab.
    Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, Qifeng Chen
    arXiv, 2022.10 [Paper], [PDF], [Code], [Home Page]
  • Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation
    Team: UC Santa Barbara, Meta.
    Tsu-Jui Fu, Licheng Yu, Ning Zhang, et al., Sean Bell
    CVPR'23, arXiv, 2022.11 [Paper], [PDF]
  • Phenaki: Variable Length Video Generation From Open Domain Textual Description
    Team: Google.
    Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, et al., Dumitru Erhan
    ICLR'23, arXiv, 2022.10 [Paper], [PDF], [Home Page]
  • Imagen Video: High Definition Video Generation with Diffusion Models
    Team: Google.
    Jonathan Ho, William Chan, Chitwan Saharia, et al., Tim Salimans
    arXiv, 2022.10 [Paper], [PDF], [Home Page]
  • StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation Story Visualization
    Team: UNC Chapel Hill.
    Adyasha Maharana, Darryl Hannan, Mohit Bansal
    ECCV'22, arXiv, 2022.09 [Paper], [PDF], [Code], [Demo(live)]
  • Make-A-Video: Text-to-Video Generation without Text-Video Data
    Team: Meta AI.
    Uriel Singer, Adam Polyak, Thomas Hayes, et al., Yaniv Taigman
    ICLR'23, arXiv, 2022.09 [Paper], [PDF], [Code]
  • MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model
    Team: S-Lab, SenseTime.
    Mingyuan Zhang, Zhongang Cai, Liang Pan, et al., Ziwei Liu
    TPAMI'24, arxiv, 2022.08 [Paper], [PDF], [Code], [Home Page], [Demo]
  • Word-Level Fine-Grained Story Visualization Story Visualization
    Team: University of Oxford.
    Bowen Li, Thomas Lukasiewicz
    ECCV'22, arXiv, 2022.08 [Paper], [PDF], [Code], [Pretrained Model]
  • CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
    Team: Tsinghua University.
    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang
    ICLR'23, arXiv, 2022.05 [Paper], [PDF], [Code], [Home Page], [Demo(video)]
  • CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers
    Team: Tsinghua University.
    Ming Ding, Wendi Zheng, Wenyi Hong, Jie Tang
    NeurIPS'22, arXiv, 2022.04 [Paper], [PDF], [Code], [Home Page]
  • Long video generation with time-agnostic vqgan and time-sensitive transformer
    Team: Meta AI.
    Songwei Ge, Thomas Hayes, Harry Yang, et al., Devi Parikh
    ECCV'22 arXiv, 2022.04 [Paper], [PDF], [Home Page], [Code]
  • Video Diffusion Models text-conditioned
    Team: Google.
    Jonathan Ho, Tim Salimans, Alexey Gritsenko, et al., David J. Fleet
    arXiv, 2022.04 [Paper], [PDF], [Home Page]
  • NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis Long Video
    Team: Microsoft.
    Chenfei Wu, Jian Liang, Xiaowei Hu, et al., Nan Duan
    NeurIPS'22, arXiv, 2022.02 [Paper], [PDF], [Code], [Home Page]
  • NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
    Team: Microsoft.
    Chenfei Wu, Jian Liang, Lei Ji, et al., Nan Duan
    ECCV'22, arXiv, 2021.11 [Paper], [PDF], [Code]
  • GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions
    Team: Microsoft, Duke University.
    Chenfei Wu, Lun Huang, Qianxi Zhang, et al., Nan Duan
    arXiv, 2021.04 [Paper], [PDF]
  • Cross-Modal Dual Learning for Sentence-to-Video Generation
    Team: Tsinghua University.
    Yue Liu, Xin Wang, Yitian Yuan, Wenwu Zhu
    ACM MM'19 [Paper], [PDF]
  • IRC-GAN: introspective recurrent convolutional GAN for text-to-video generation
    Team: Peking University.
    Kangle Deng, Tianyi Fei, Xin Huang, Yuxin Peng
    IJCAI'19 [Paper], [PDF]
  • Imagine this! scripts to compositions to videos
    Team: University of Illinois Urbana-Champaign, AI2, University of Washington.
    Tanmay Gupta, Dustin Schwenk, Ali Farhadi, et al., Aniruddha Kembhavi
    ECCV'18, arxiv, 2018.04 [Paper], [PDF]
  • To Create What You Tell: Generating Videos from Captions
    Team: USTC, Microsoft Research.
    Yingwei Pan, Zhaofan Qiu, Ting Yao, et al., Tao Mei
    ACM MM'17, arxiv, 2018.04 [Paper], [PDF]
  • Neural Discrete Representation Learning.
    Team: DeepMind.
    Aaron van den Oord, Oriol Vinyals, Dinghan Shen, Koray Kavukcuoglu
    NeurIPS'17, arxiv, 2017.11 [Paper], [PDF]
  • Video Generation From Text.
    Team: Duke University, NEC Labs America.
    Yitong Li, Martin Renqiang Min, Dinghan Shen, et al., Lawrence Carin
    AAAI'18, arxiv, 2017.10 [Paper], [PDF]
  • Attentive semantic video generation using captions.
    Team: IIT Hyderabad.
    Tanya Marwah, Gaurav Mittal, Vineeth N. Balasubramanian
    ICCV'17, arxiv, 2017.08 [Paper], [PDF]
  • Sync-DRAW: Automatic Video Generation using Deep Recurrent Attentive Architectures VAE
    Team: IIT Hyderabad.
    Gaurav Mittal, Tanya Marwah, Vineeth N. Balasubramanian
    ACM MM'17, arXiv, 2016.11 [Paper], [PDF]

Datasets & Metrics

Datasets are divided according to their collected domains: Face, Open, Movie, Action, Instruct.
Metrics are divided as image-level, video-level.

Dataset Domain Annotated #Clips #Sent Len_C(s) Len_S #Videos Resolution FPS Dur(h) Year Source
CV-Text Face Generated 70K 1400K - 67.2 - 480P - - 2023 Online
MSR-VTT Open Manual 10K 200K 15.0s 9.3 7.2K 240P 30 40 2016 YouTube
DideMo Open Manual 27K 41K 6.9s 8.0 10.5K - - 87 2017 Flickr
Y-T-180M Open ASR 180M - - - 6M - - - 2021 YouTube
WVid2M Open Alt-text 2.5M 2.5M 18.0 12.0 2.5M 360P - 13K 2021 Web
H-100M Open ASR 103M - 13.4 32.5 3.3M 720P - 371.5K 2022 YouTube
InternVid Open Generated 234M - 11.7 17.6 7.1M *720P - 760.3K 2023 YouTube
H-130M Open Generated 130M 130M - 10.0 - 720P - - 2023 YouTube
Y-mP Open Manual 10M 10M 54.2 - - - - 150K 2023 Youku
V-27M Open Generated 27M 135M 12.5 - - - - - 2024 YouTube
P-70M Open Generated - 70.8M 8.5 13.2 70.8M 720P - 166.8K 2024 YouTube
LSMDC Movie Manual 118K 118K 4.8s 7.0 200 1080P - 158 2017 Movie
MAD Movie Manual - 384K - 12.7 650 - - 1.2K 2022 Movie
UCF-101 Action Manual 13K - 7.2s - - 240P 25 27 2012 YouTube
ANet-200 Action Manual 100K - - 13.5 2K *720P 30 849 2015 YouTube
Charades Action Manual 10K 16K - - 10K - - 82 2016 Home
Kinetics Action Manual 306K - 10.0s - 306K - - - 2017 YouTube
ActNet Action Manual 100K 100K 36.0s 13.5 20K - - 849 2017 YouTube
C-Ego Action Manual - - - - 8K 240P - 69 2018 Home
SS-V2 Action Manual - - - - 220.1K - 12 - 2018 Daily
How2 Instruct Manual 80K 80K 90.0 20.0 13.1K - - 2000 2018 YouTube
HT100M Instruct ASR 136M 136M 3.6 4.0 1.2M 240P - 134.5K 2019 YouTube
YCook2 Cooking Manual 14K 14K 19.6 8.8 2K - - 176 2018 YouTube
E-Kit Cooking Manual 40K 40K - - 432 *1080P 60 55 2018 Home
  • (VidProM) VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models Dataset (adding)
    Team: ReLER Lab.
    Wenhao Wang, Yi Yang
    arXiv, 2024.03 [Paper], [PDF], [Code], [Hugging Face]

  • (ECTV) EvalCrafter: Benchmarking and Evaluating Large Video Generation Models Dataset (adding)
    Team: Tencent AI Lab, CUHK.
    Yaofang Liu, Xiaodong Cun, Xuebo Liu, et al., Ying Shan
    CVPR'24, arXiv, 2023.10 [Paper], [PDF], [Code], [Dataset], [Home Page]

  • (CV-Text) Celebv-text: A large-scale facial text-video datase Dataset (Domain:Face)
    Team: University of Sydney, SenseTime Research.
    Jianhui Yu, Hao Zhu, Liming Jiang, et al., Wayne Wu
    CVPR'23, arXiv, 2023.03 [Paper], [PDF], [Code], [Demo], [Home Page]

  • (MSR-VTT) Msr-vtt: A large video description dataset for bridging video and language Dataset (Domain:Open)
    Team: Microsoft Research.
    Jun Xu , Tao Mei , Ting Yao and Yong Rui
    CVPR'16 [Paper], [PDF]

  • (DideMo) Localizing moments in video with natural language Dataset (Domain:Open)
    Team: UC Berkeley, Adobe
    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, et al., Bryan Russell
    ICCV'17, arXiv, 2017.08 [Paper], [PDF]

  • (YT-Tem-180M) Merlot: Multimodal neural script knowledge models Dataset (Domain:Open)
    Team: University of Washington
    Rowan Zellers, Ximing Lu, Jack Hessel, et al., Yejin Choi
    NeurIPS'21, arXiv, 2021.06 [Paper], [PDF], [Code], [Home Page]

  • (WebVid2M) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Dataset (Domain:Open)
    Team: University of Oxford, CNRS.
    Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman
    ICCV'21, arXiv, 2021.04 [Paper], [PDF],[Dataset], [Code],[Demo], [Home Page]

  • (HD-VILA-100M) Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions Dataset (Domain:Open)
    Team: Microsoft Research Asia.
    Hongwei Xue, Tiankai Hang, Yanhong Zeng, et al., Baining Guo
    CVPR'22, arXiv, 2021.11 [Paper], [PDF], [Code]

  • (InterVid) Internvid: A large-scale video-text dataset for multimodal understanding and generation Dataset (Domain:Open)
    Team: Shanghai AI Laboratory.
    Yi Wang, Yinan He, Yizhuo Li, et al., Yu Qiao
    arXiv, 2023.07 [Paper], [PDF], [Code]

  • (HD-VG-130M) VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation Dataset (Domain:Open)
    Team: Peking University, Microsoft Research.
    Wenjing Wang, Huan Yang, Zixi Tuo, et al., Jiaying Liu
    arXiv, 2023.05 [Paper], [PDF]

  • (Youku-mPLUG) Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks Dataset (Domain:Open)
    Team: DAMO Academy, Alibaba Group.
    Haiyang Xu, Qinghao Ye, Xuan Wu, et al., Fei Huang
    arXiv, 2023.06 [Paper], [PDF]

  • (VAST-27M) Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset Dataset (Domain:Open)
    Team: UCAS, CAS
    Sihan Chen, Handong Li, Qunbo Wang, et al., Jing Liu
    NeurIPS'23, arXiv, 2023.05 [Paper], [PDF]

  • (Panda-70M) Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers Dataset (Domain:Open)
    Team: Snap Inc., University of California, University of Trento.
    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Sergey Tulyakov
    arXiv, 2024.02 [Paper], [PDF], [Code], [Home Page]

  • (LSMDC) Movie description Dataset (Domain:Movie)
    Team: Max Planck Institute for Informatics.
    Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, et al., Bernt Schiele
    IJCV'17, arXiv, 2016.05 [Paper], [PDF], [Home Page]

  • (MAD) Mad: A scalable dataset for language grounding in videos from movie audio descriptions Dataset (Domain:Movie)
    Team: KAUST, Adobe Research.
    Mattia Soldan, Alejandro Pardo, Juan León Alcázar, et al., Bernard Ghanem
    CVPR'22, arXiv, 2021.12 [Paper], [PDF], [Code]

  • (UCF-101) UCF101: A dataset of 101 human actions classes from videos in the wild Dataset (Domain:Action)
    Team: University of Central Florida.
    Khurram Soomro, Amir Roshan Zamir, Mubarak Shah
    arXiv, 2012.12 [Paper], [PDF], [Data]

  • (ActNet-200) Activitynet: A large-scale video benchmark for human activity understanding Dataset (Domain:Action)
    Team: Universidad del Norte, KAUST
    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, Juan Carlos Niebles
    CVPR'15, [Paper], [PDF], [Home Page]

  • (Charades) Hollywood in homes: Crowdsourcing data collection for activity understanding Dataset (Domain:Action)
    Team: Carnegie Mellon University
    Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, et al., Abhinav Gupta
    ECCV'16, arXiv, 2016.04, [Paper], [PDF], [Home Page]

  • (Kinetics) The kinetics human action video dataset Dataset (Domain:Action)
    Team: Google
    Will Kay, Joao Carreira, Karen Simonyan, et al., Andrew Zisserman
    arXiv, 2017.05, [Paper], [PDF], [Home Page]

  • (ActivityNet) Dense-captioning events in videos Dataset (Domain:Action)
    Team: Stanford University
    Ranjay Krishna, Kenji Hata, Frederic Ren, et al., Juan Carlos Niebles
    ICCV'17, arXiv, 2017.05, [Paper], [PDF], [Home Page]

  • (Charades-Ego) Charades-ego: A large-scale dataset of paired third and first person videos Dataset (Domain:Action)
    Team: Carnegie Mellon University
    Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, et al., Karteek Alahari
    arXiv, 2018.04, [Paper], [PDF], [Home Page]

  • (SS-V2) The "something something" video database for learning and evaluating visual common sense Dataset (Domain:Action)
    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, et al., Roland Memisevic
    ICCV'17, arXiv, 2017.06 [Paper], [PDF], [Home Page]

  • (How2) How2: a large-scale dataset for multimodal language understanding Dataset (Domain:Instruct)
    Team: Carnegie Mellon University.
    Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, et al., Florian Metze
    arXiv, 2018.11 [Page], [PDF]

  • (HowTo100M) HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Dataset (Domain:Instruct)
    Team: Ecole Normale Superieure, Inria, CIIRC.
    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, et al., Josef Sivic
    arXiv, 2019.06 [Page], [PDF], [Home Page]

  • (YouCook2) Towards automatic learning of procedures from web instructional video Dataset (Domain:Cooking)
    Team: University of Michigan, University of Rochester
    Luowei Zhou, Chenliang Xu, Jason J. Corso
    AAAI'18, arXiv, 2017.03 , [Paper], [PDF],[Home Page]

  • (Epic-Kichens) Scaling egocentric vision: The epic-kitchens dataset Dataset (Domain:Cookding)
    Team: Uni. of Bristol, Uni. of Catania, Uni. of Toronto.
    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, et al., Michael Wray
    ECCV'18, arXiv, 2018.04, [Paper], [PDF], [Home Page]

  • (PSNR/SSIM) Image quality assessment: from error visibility to structural similarity Metric (image-level)
    Team: New York University.
    Zhou Wang, Alan Conrad Bovik, Hamid Rahim Sheikh, E.P. Simoncelli
    IEEE TIP, 2004.04. [Paper], [PDF]

  • (IS) Improved techniques for training gans Metric (image-level)
    Team: OpenAI
    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, et al., Xi Chen
    NeurIPS'16, arXiv, 2016.06, [Paper], [PDF], [Code]

  • (FID) Gans trained by a two time-scale update rule converge to a local nash equilibrium Metric (image-level)
    Team: Johannes Kepler University Linz
    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, et al., Sepp Hochreiter
    NeurIPS'17, arXiv, 2017.06 [Paper], [PDF]

  • (CLIP Score) Learning transferable visual models from natural language supervision Metric (image-level)
    Team: OpenAI.
    Alec Radford, Jong Wook Kim, Chris Hallacy, et al., Ilya Sutskever
    ICML'21, arXiv, 2021.02 [Paper], [PDF], [Code]

  • (Video IS) Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan Metric (video-level)
    Masaki Saito, Shunta Saito, Masanori Koyama, Sosuke Kobayashi
    IJCV'20, arXiv, 2018.11 [Paper], [PDF], [Code]

  • (FVD/KVD) FVD: A new metric for video generation Metric (video-level)
    Team: Johannes Kepler University, Google
    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, et al., Sylvain Gelly
    ICLR'19, arXiv, 2018.12 [Paper], [PDF], [Code]

  • (FCS) Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation Metric (video-level)
    Team: Show Lab, National University of Singapore.
    Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Mike Zheng Shou et al
    ICCV'23, arxiv, 2022.12[Paper], [PDF], [Code], [Pretrained Model]


Acknowledgement and References

Citation

If you find this repository useful, please consider citing our paper and this list:

@article{sun2024sora,
  title={From Sora What We Can See: A Survey of Text-to-Video Generation},
  author={Sun, Rui and Zhang, Yumin and Shah, Tejal and Sun, Jiahao and Zhang, Shuoying and Li, Wenqi and Duan, Haoran and Wei, Bo and Ranjan, Rajiv},
  journal={arXiv preprint arXiv:2405.10674},
  year={2024}
}

@misc{sun2024t2vgenerationlist,
  title={Awesome-Text-to-Video-Generation},
  author={Sun, Rui and Zhang, Yumin},
  year={2024},
  publisher={GitHub},
  howpublished={\url{https://github.com/soraw-ai/Awesome-Text-to-Video-Generation}},
}

About

A list for Text-to-Video, Image-to-Video works

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published