Awesome-Text-to-Video-Generation

A curated (continually updated) list of Text-to-Video studies. It's based on our survey paper: From Sora What We Can See: A Survey of Text-to-Video Generation. In this survey, We have conducted a comprehensive exploration of existing works in the Text-to-Video field using OpenAI’s Sora as a clue, and we have also summarized 24 datasets and 9 evaluation metrics in this field. Specifically, we discussed the problems existing in this research area and Sora itself, combined with the advantages of Sora and the characteristics of related fields to provide future research directions. If our work can inspire you, feel free to cite our paper and star our repo.

This project is curated and maintained by Rui Sun and Yumin Zhang.

@article{sun2024sora,
  title={From Sora What We Can See: A Survey of Text-to-Video Generation},
  author={Sun, Rui and Zhang, Yumin and Shah, Tejal and Sun, Jiahao and Zhang, Shuoying and Li, Wenqi and Duan, Haoran and Wei, Bo and Ranjan, Rajiv},
  journal={arXiv preprint arXiv:2405.10674},
  year={2024}
}

Topics of this repo cover:
Text-to-Seq-Image, Text-to-Video

Text-to-Seq-Image

LivePhoto: Real Image Animation with Text-guided Motion Control
Team: HKU, Alibaba Group, Ant Group.
Xi Chen, Zhiheng Liu, Mengting Chen, et al., Hengshuang Zhao
arXiv, 2023.12 [Paper], [PDF], [Code], [Demo (Video)], [Home Page]
Scalable Diffusion Models with Transformers Sequential Images
Team: UC Berkeley, NYU.
William Peebles, Saining Xie
ICCV'23(Oral), arXiv, 2022.12 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]

Text-to-Video

Video generation models as world simulators
Team: Sora, Open AI.
Tim Brooks, Bill Peebles, Connor Homes, et al., Aditya Ramesh
online page, 2024.02 [Paper], [Home Page]
ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation
Team: University of Waterloo.
Weiming Ren, Harry Yang, Ge Zhang, et al., Wenhu Chen
arXiv, 2024.02 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
World Model on Million-Length Video And Language With RingAttention Long Video
Team: UC Berkeley.
Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel
arXiv, 2024.02 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model
Team: Peking University.
Qian Wang, Weiqi Li, Chong Mou, et al., Jian Zhang
arXiv, 2024.01 [Paper], [PDF], [Code], [Home Page]
MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation
Team: Bytedance Inc.
Weimin Wang, Jiawei Liu, Zhijie Lin, et al., Jiashi Feng
arXiv, 2024.01 [Paper], [PDF], [Home Page]
UniVG: Towards UNIfied-modal Video Generation
Team: Baidu Inc.
Ludan Ruan, Lei Tian, Chuanwei Huang, et al., Xinyan Xiao
arXiv, 2024.01 [Paper], [PDF], [Home Page]
VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM
Team: HiDream.ai Inc.
Fuchen Long, Zhaofan Qiu, Ting Yao and Tao Mei
arXiv, 2024.01 [Paper], [PDF], [Home Page]
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
Team: Tencent AI Lab.
Haoxin Chen, Yong Zhang, Xiaodong Cun, et al., Ying Shan
arXiv, 2024.01 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
Lumiere: A Space-Time Diffusion Model for Video Generation
Team: Google Research, Weizmann Institute, Tel-Aviv University, Technion.
Omer Bar-Tal, Hila Chefer, Omer Tov, et al., Inbar Mosseri
arXiv, 2024.01 [Paper], [PDF], [Home Page]
DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
Team: Fudan University, Alibaba Group, HUST, Zhejiang University.
Yujie Wei, Shiwei Zhang, Zhiwu Qing, et al., Hongming Shan
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page]
VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
Team: Peking University, Microsoft Research.
Wenjing Wang, Huan Yang, Zixi Tuo, et al., Jiaying Liu
arXiv, 2023.12 [Paper], [PDF]
TrailBlazer: Trajectory Control for Diffusion-Based Video Generation Training-free
Team: Victoria University of Wellington, NVIDIA
Wan-Duo Kurt Ma, J.P. Lewis, W. Bastiaan Kleijn
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(video)]
FreeInit: Bridging Initialization Gap in Video Diffusion Models Training-free
Team: Nanyang Technological University
Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(live)], [Demo(video)]
MTVG : Multi-text Video Generation with Text-to-Video Models Training-free
Team: Korea University, NVIDIA
Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, et al., Sangpil Kim
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(video)]
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
Team: HUST, Alibaba Group, Zhejiang University, Ant Group
Xiang Wang, Shiwei Zhang, Hangjie Yuan, et al., Nong Sang
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page]
InstructVideo: Instructing Video Diffusion Models with Human Feedback
Team: Zhejiang University, Alibaba Group, Tsinghua University
Hangjie Yuan, Shiwei Zhang, Xiang Wang, et al., Dong Ni
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page]
VideoLCM: Video Latent Consistency Model
Team: HUST, Alibaba Group, SJTU
Xiang Wang, Shiwei Zhang, Han Zhang, et al., Nong Sang
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page]
Photorealistic Video Generation with Diffusion Models
Team: Stanford University Fei-Fei Li, Google.
Agrim Gupta, Lijun Yu, Kihyuk Sohn, et al., José Lezama
arXiv, 2023.12 [Paper], [PDF], [Home Page]
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
Team: HUST, Alibaba Group, Fudan University.
Zhiwu Qing, Shiwei Zhang, Jiayu Wang, et al., Nong Sang
arXiv, 2023.12 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation
Team: HKU, Meta.
Shoufa Chen, Mengmeng Xu, Jiawei Ren, et al., Juan-Manuel Perez-Rua
arXiv, 2023.12 [Paper], [PDF], [Home Page]
StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter
Team: Tsinghua University, Tencent AI Lab, CUHK.
Gongye Liu, Menghan Xia, Yong Zhang, et al., Ying Shan
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(live)]
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation Multimodal
Team: Tencent.
Zhanyu Wang, Longyue Wang, Zhen Zhao, et al., Zhaopeng Tu
arXiv, 2023.11 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis Training-free
Team: University of Electronic Science and Technology of China.
Sitong Su, Jianzhi Liu, Lianli Gao, Jingkuan Song
arXiv, 2023.11 [Paper], [PDF]
AdaDiff: Adaptive Step Selection for Fast Diffusion Training-free
Team: Fudan University.
Hui Zhang, Zuxuan Wu, Zhen Xing, Jie Shao, Yu-Gang Jiang
arXiv, 2023.11 [Paper], [PDF]
FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax Training-free
Team: University of Technology Sydney.
Yu Lu, Linchao Zhu, Hehe Fan, Yi Yang
arXiv, 2023.11 [Paper], [PDF], [Code(coming)], [Home Page]
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning Training-free
Team: Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences.
Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, et al., Shifeng Chen
arXiv, 2023.11 [Paper], [PDF], [Code(coming)], [Home Page]
MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation
Team: University of Science and Technology of China, MSRA, Xi'an Jiaotong University.
Yanhui Wang, Jianmin Bao, Wenming Weng, et al., Baining Guo
arXiv, 2023.11 [Paper], [PDF], [Home Page], [Demo(video)]
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation
Team: University of Science and Technology of China, MSRA, Xi'an Jiaotong University.
Yuanxin Liu, Lei Li, Shuhuai Ren, et al., Lu Hou
arXiv, 2023.11 [Paper], [PDF], [Code], [Dataset]
ART⋅V: Auto-Regressive Text-to-Video Generation with Diffusion Models
Team: University of Science and Technology of China, Microsoft.
Wenming Weng, Ruoyu Feng, Yanhui Wang, et al., Zhiwei Xiong
arXiv, 2023.11 [Paper], [PDF], [Code(coming)], [Home Page], [Demo(video)]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Team: Stability AI.
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, et al., Robin Rombach
arXiv, 2023.11 [Paper], [PDF], [Code]
FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline
Team: Sber AI.
Vladimir Arkhipkin, Zein Shaheen, Viacheslav Vasilev, et al., Denis Dimitrov
arXiv, 2023.11 [Paper], [PDF], [Code], [Home Page], [Demo(live)]
MoVideo: Motion-Aware Video Generation with Diffusion Models
Team: ETH, Meta.
Jingyun Liang, Yuchen Fan, Kai Zhang, et al., Rakesh Ranjan
arXiv, 2023.11 [Paper], [PDF], [Home Page]
Optimal Noise pursuit for Augmenting Text-to-Video Generation
Team: Zhejiang Lab.
Shijie Ma, Huayi Xu, Mengjian Li, et al., Yaxiong Wang
arXiv, 2023.11 [Paper], [PDF]
Make Pixels Dance: High-Dynamic Video Generation
Team: ByteDance.
Yan Zeng, Guoqiang Wei, Jiani Zheng, et al., Hang Li
arXiv, 2023.11 [Paper], [PDF], [Home Page], [Demo(video)]
Learning Universal Policies via Text-Guided Video Generation
Team: MIT, Google DeepMind, UC Berkeley.
Yilun Du, Mengjiao Yang, Bo Dai, et al., Pieter Abbeel
NeurIPS'23 (Spotlight), arXiv, 2023.11 [Paper], [PDF], [Code], [Home Page]
Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
Team: Meta.
Rohit Girdhar, Mannat Singh, Andrew Brown, et al., Ishan Misra
arXiv, 2023.11 [Paper], [PDF], [Home Page], [Demo(live)]
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling Training-free
Team: Nanyang Technological University.
Haonan Qiu, Menghan Xia, Yong Zhang, et al., Ziwei Liu
ICLR'24 arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page]
ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation Training-free
Team: Shanghai Artificial Intelligence Laboratory.
Bo Peng, Xinyuan Chen, Yaohui Wang, Chaochao Lu, Yu Qiao
arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page]
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Team: Tencent AI Lab.
Haoxin Chen, Menghan Xia, Yingqing He, et al., Ying Shan
arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page]
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction
Team: Shanghai Artificial Intelligence Laboratory.
Xinyuan Chen, Yaohui Wang, Lingjun Zhang, et al., Ziwei Liu
arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page]
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
Team: The Chinese University of Hong Kong.
Jinbo Xing, Menghan Xia, Yong Zhang, et al., Ying Shan
arXiv, 2023.10 [Paper], [PDF], [Code], [Pretrained Model], [Home Page], [Demo(live)], [Demo(video)]
LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation
Team: Nankai University, MEGVII Technology.
Ruiqi Wu, Liangyu Chen, Tong Yang, et al., Xiangyu Zhang
arXiv, 2023.10 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
LLM-grounded Video Diffusion Models Training-free
Team: UC Berkeley.
Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, Boyi Li
arXiv, 2023.09 [Paper], [PDF], [Code(coming)], [Home Page]
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning
Team: UNC Chapel Hill.
Han Lin, Abhay Zala, Jaemin Cho, Mohit Bansal
arXiv, 2023.09 [Paper], [PDF], [Code]
VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation
Team: Baidu Inc.
Xin Li, Wenqing Chu, Ye Wu, et al., Jingdong Wang
arXiv, 2023.09 [Paper], [PDF], [Home Page]
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
Team: Shanghai Artificial Intelligence Laboratory.
Yaohui Wang, Xinyuan Chen, Xin Ma, et al., Ziwei Liu
arXiv, 2023.09 [Paper], [PDF], [Code], [Home Page]
Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation
Team: Huawei.
Jiaxi Gu, Shicong Wang, Haoyu Zhao, et al., Hang Xu
arXiv, 2023.09 [Paper], [PDF], [Code], [Home Page]
Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator Training-free
Team: School of Information Science and Technology, ShanghaiTech University.
Hanzhuo Huang, Yufan Feng, Cheng Shi, et al., Sibei Yang
NeurIPS'24, arxiv, 2023.9[Paper], [PDF], [Home Page]
Show-1: Marrying pixel and latent diffusion models for text-to-video generation.
Team: Show Lab, National University of Singapor
David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, et al., Mike Zheng Shou
arXiv, 2023.09 [Paper], [PDF], [Home Page],[Code], [Pretrained Model]
GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER
Team: Institute of Automation, Chinese Academy of Sciences (CASIA).
Mingzhen Sun, Weining Wang, Zihan Qin, et al., Jing Liu
NeurIPS'23, arXiv, 2023.09 [Paper], [PDF], [Code], [Home Page], [[Demo(video)]
DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis Training-free
Team: East China Normal University.
Zhongjie Duan, Lizhou You, Chengyu Wang, et al., Jun Huang
arXiv, 2023.08 [Paper], [PDF], [Home Page]
SimDA: Simple Diffusion Adapter for Efficient Video Generation
Team: Fudan University, Microsoft.
Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, Yu-Gang Jiang
arXiv, 2023.08 [Paper], [PDF], [Code (Coming)], [Home Page]
Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models
Team: National University of Singapore.
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua
arXiv, 2023.08 [Paper], [PDF], [Code]
ModelScope Text-to-Video Technical Report
Team: Alibaba Group.
Jiuniu Wang, Hangjie Yuan, Dayou Chen, et al., Shiwei Zhang
arXiv, 2023.08 [Paper], [PDF], [Code], [Home Page], [[Demo(live)]
Dual-Stream Diffusion Net for Text-to-Video Generation
Team: Nanjing University of Science and Technology.
Binhui Liu, Xin Liu, Anbo Dai, et al., Jian Yang
arXiv, 2023.08 [Paper], [PDF]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Team: The Chinese University of Hong Kong.
Yuwei Guo, Ceyuan Yang, Anyi Rao, et al., Bo Dai
ICLR'24 (spotlight), arXiv, 2023.07 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation
Team: HKUST.
Yingqing He, Menghan Xia, Haoxin Chen, et al., Qifeng Chen
arXiv, 2023.07 [Paper], [PDF], [Code], [Home Page], [[Demo(video)]
Probabilistic Adaptation of Text-to-Video Models
Team: Google, UC Berkeley.
Mengjiao Yang, Yilun Du, Bo Dai, et al., Pieter Abbeel
arXiv, 2023.06 [Paper], [PDF], [Home Page]
ED-T2V: An Efficient Training Framework for Diffusion-based Text-to-Video Generation
Team: School of Artificial Intelligence, University of Chinese Academy of Sciences.
Jiawei Liu, Weining Wang, Wei Liu, Qian He, Jing Liu
IJCNN'23, 2023.06 [Paper], [PDF]
Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance
Team: CUHK.
Jinbo Xing, Menghan Xia, Yuxin Liu, et al., Tien-Tsin Wong
arXiv, 2023.06 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
VideoComposer: Compositional Video Synthesis with Motion Controllability
Team: Alibaba Group.
Xiang Wang, Hangjie Yuan, Shiwei Zhang, et al., Jingren Zhou
NeurIPS'23, arXiv, 2023.06 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation
Team: University of Chinese Academy of Sciences (UCAS), Alibaba Group.
Zhengxiong Luo, Dayou Chen, Yingya Zhang, et al., Tieniu Tan
CVPR'23, arXiv, 2023.06 [Paper], [PDF]
DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation Training-free
Team: Korea University.
Susung Hong, Junyoung Seo, Heeseong Shin, Sunghwan Hong, Seungryong Kim
arXiv, 2023.05 [Paper], [PDF]
Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models
Team: Carnegie Mellon Univeristy.
Rohan Dhesikan, Vignesh Rajmohan
arXiv, 2023.05 [Paper], [PDF], [Code(coming)]
Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models
Team: University of Maryland.
Songwei Ge, Seungjun Nah, Guilin Liu, et al., Yogesh Balaji
ICCV'23, arXiv, 2023.05 [Paper], [PDF], [Home Page]
Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity
Team: NUS, CUHK.
Zijiao Chen, Jiaxin Qing, Juan Helen Zhou
NeurIPS'24, arXiv, 2023.05 [Paper], [PDF], [Code], [Home Page]
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Team: Google Research
Dan Kondratyuk, Lijun Yu, Xiuye Gu, et al., Lu Jiang
arXiv, 2023.05 [Paper], [PDF], [Home Page], [Blog]
VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning
Team: Tsinghua University, Beijing Film Academy
Hong Chen, Xin Wang, Guanning Zeng, et al., WenwuZhu
arXiv, 2023.05 [Paper], [PDF], [Code], [Home Page]
Text2Performer: Text-Driven Human Video Generation
Team: Nanyang Technological University
Yuming Jiang, Shuai Yang, Tong Liang Koh, et al., Ziwei Liu
arXiv, 2023.04 [Paper], [PDF], [Code], [Home Page], [[Demo(video)]
Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation
Team: University of Rochester, Meta.
Jie An, Songyang Zhang, Harry Yang, et al., Xi Yin
arXiv, 2023.04 [Paper], [PDF], [Home Page]
Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos
Team: Tsinghua University, HKUST.
Yue Ma, Yingqing He, Xiaodong Cun, et al., Qifeng Chen
AAAI'24, arXiv, 2023.04 [Paper], [PDF], [Home Page], [Code]
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
Team: NVIDIA.
Andreas Blattmann, Robin Rombach, Huan Ling, et al., Karsten Kreis
CVPR'23, arXiv, 2023.04 [Paper], [PDF], [Home Page]
NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation
Team: University of Science and Technology of China, Microsoft.
Shengming Yin, Chenfei Wu, Huan Yang, et al. , Nan Duan
arXiv, 2023.03 [Paper], [PDF], [Home Page]
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
Team: Picsart AI Resarch (PAIR).
Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, et al., Humphrey Shi
arXiv, 2023.03 [Paper], [PDF], [Code], [Home Page], [Demo(live)], [Demo(video)]
Structure and Content-Guided Video Synthesis with Diffusion Models
Team: Runway
Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, Anastasis Germanidis
ICCV'23, arXiv, 2023.02 [Paper], [PDF], [Home Page]
SceneScape: Text-Driven Consistent Scene Generation
Team: Weizmann Institute of Science, NVIDIA Research
Rafail Fridman, Amit Abecasis, Yoni Kasten, Tali Dekel
NeurIPS'23, arXiv, 2023.02 [Paper], [PDF], [Code], [Home Page]
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
Team: Renmin University of China, Peking University, Microsoft Research
Ludan Ruan, Yiyang Ma, Huan Yang, et al., Baining Guo
CVPR'23, arXiv, 2022.12 [Paper], [PDF], [Code]
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Team: Show Lab, National University of Singapore.
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Mike Zheng Shou et al
ICCV'23, arxiv, 2022.12[Paper], [PDF], [Code], [Pretrained Model]
MagicVideo: Efficient Video Generation With Latent Diffusion Models
Team: ByteDance Inc.
Daquan Zhou, Weimin Wang, Hanshu Yan, et al., Jiashi Feng
arXiv, 2022.11 [Paper], [PDF], [Home Page]
Latent Video Diffusion Models for High-Fidelity Long Video Generation Long Video
Team: HKUST, Tencent AI Lab.
Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, Qifeng Chen
arXiv, 2022.10 [Paper], [PDF], [Code], [Home Page]
Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation
Team: UC Santa Barbara, Meta.
Tsu-Jui Fu, Licheng Yu, Ning Zhang, et al., Sean Bell
CVPR'23, arXiv, 2022.11 [Paper], [PDF]
Phenaki: Variable Length Video Generation From Open Domain Textual Description
Team: Google.
Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, et al., Dumitru Erhan
ICLR'23, arXiv, 2022.10 [Paper], [PDF], [Home Page]
Imagen Video: High Definition Video Generation with Diffusion Models
Team: Google.
Jonathan Ho, William Chan, Chitwan Saharia, et al., Tim Salimans
arXiv, 2022.10 [Paper], [PDF], [Home Page]
StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation Story Visualization
Team: UNC Chapel Hill.
Adyasha Maharana, Darryl Hannan, Mohit Bansal
ECCV'22, arXiv, 2022.09 [Paper], [PDF], [Code], [Demo(live)]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Team: Meta AI.
Uriel Singer, Adam Polyak, Thomas Hayes, et al., Yaniv Taigman
ICLR'23, arXiv, 2022.09 [Paper], [PDF], [Code]
MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model
Team: S-Lab, SenseTime.
Mingyuan Zhang, Zhongang Cai, Liang Pan, et al., Ziwei Liu
TPAMI'24, arxiv, 2022.08 [Paper], [PDF], [Code], [Home Page], [Demo]
Word-Level Fine-Grained Story Visualization Story Visualization
Team: University of Oxford.
Bowen Li, Thomas Lukasiewicz
ECCV'22, arXiv, 2022.08 [Paper], [PDF], [Code], [Pretrained Model]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Team: Tsinghua University.
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang
ICLR'23, arXiv, 2022.05 [Paper], [PDF], [Code], [Home Page], [Demo(video)]
CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers
Team: Tsinghua University.
Ming Ding, Wendi Zheng, Wenyi Hong, Jie Tang
NeurIPS'22, arXiv, 2022.04 [Paper], [PDF], [Code], [Home Page]
Long video generation with time-agnostic vqgan and time-sensitive transformer
Team: Meta AI.
Songwei Ge, Thomas Hayes, Harry Yang, et al., Devi Parikh
ECCV'22 arXiv, 2022.04 [Paper], [PDF], [Home Page], [Code]
Video Diffusion Models text-conditioned
Team: Google.
Jonathan Ho, Tim Salimans, Alexey Gritsenko, et al., David J. Fleet
arXiv, 2022.04 [Paper], [PDF], [Home Page]
NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis Long Video
Team: Microsoft.
Chenfei Wu, Jian Liang, Xiaowei Hu, et al., Nan Duan
NeurIPS'22, arXiv, 2022.02 [Paper], [PDF], [Code], [Home Page]
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
Team: Microsoft.
Chenfei Wu, Jian Liang, Lei Ji, et al., Nan Duan
ECCV'22, arXiv, 2021.11 [Paper], [PDF], [Code]
GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions
Team: Microsoft, Duke University.
Chenfei Wu, Lun Huang, Qianxi Zhang, et al., Nan Duan
arXiv, 2021.04 [Paper], [PDF]
Cross-Modal Dual Learning for Sentence-to-Video Generation
Team: Tsinghua University.
Yue Liu, Xin Wang, Yitian Yuan, Wenwu Zhu
ACM MM'19 [Paper], [PDF]
IRC-GAN: introspective recurrent convolutional GAN for text-to-video generation
Team: Peking University.
Kangle Deng, Tianyi Fei, Xin Huang, Yuxin Peng
IJCAI'19 [Paper], [PDF]
Imagine this! scripts to compositions to videos
Team: University of Illinois Urbana-Champaign, AI2, University of Washington.
Tanmay Gupta, Dustin Schwenk, Ali Farhadi, et al., Aniruddha Kembhavi
ECCV'18, arxiv, 2018.04 [Paper], [PDF]
To Create What You Tell: Generating Videos from Captions
Team: USTC, Microsoft Research.
Yingwei Pan, Zhaofan Qiu, Ting Yao, et al., Tao Mei
ACM MM'17, arxiv, 2018.04 [Paper], [PDF]
Neural Discrete Representation Learning.
Team: DeepMind.
Aaron van den Oord, Oriol Vinyals, Dinghan Shen, Koray Kavukcuoglu
NeurIPS'17, arxiv, 2017.11 [Paper], [PDF]
Video Generation From Text.
Team: Duke University, NEC Labs America.
Yitong Li, Martin Renqiang Min, Dinghan Shen, et al., Lawrence Carin
AAAI'18, arxiv, 2017.10 [Paper], [PDF]
Attentive semantic video generation using captions.
Team: IIT Hyderabad.
Tanya Marwah, Gaurav Mittal, Vineeth N. Balasubramanian
ICCV'17, arxiv, 2017.08 [Paper], [PDF]
Sync-DRAW: Automatic Video Generation using Deep Recurrent Attentive Architectures VAE
Team: IIT Hyderabad.
Gaurav Mittal, Tanya Marwah, Vineeth N. Balasubramanian
ACM MM'17, arXiv, 2016.11 [Paper], [PDF]

Datasets & Metrics

Datasets are divided according to their collected domains: Face, Open, Movie, Action, Instruct.
Metrics are divided as image-level, video-level.

Dataset	Domain	Annotated	#Clips	#Sent	Len_C(s)	Len_S	#Videos	Resolution	FPS	Dur(h)	Year	Source
CV-Text	Face	Generated	70K	1400K	-	67.2	-	480P	-	-	2023	Online
MSR-VTT	Open	Manual	10K	200K	15.0s	9.3	7.2K	240P	30	40	2016	YouTube
DideMo	Open	Manual	27K	41K	6.9s	8.0	10.5K	-	-	87	2017	Flickr
Y-T-180M	Open	ASR	180M	-	-	-	6M	-	-	-	2021	YouTube
WVid2M	Open	Alt-text	2.5M	2.5M	18.0	12.0	2.5M	360P	-	13K	2021	Web
H-100M	Open	ASR	103M	-	13.4	32.5	3.3M	720P	-	371.5K	2022	YouTube
InternVid	Open	Generated	234M	-	11.7	17.6	7.1M	*720P	-	760.3K	2023	YouTube
H-130M	Open	Generated	130M	130M	-	10.0	-	720P	-	-	2023	YouTube
Y-mP	Open	Manual	10M	10M	54.2	-	-	-	-	150K	2023	Youku
V-27M	Open	Generated	27M	135M	12.5	-	-	-	-	-	2024	YouTube
P-70M	Open	Generated	-	70.8M	8.5	13.2	70.8M	720P	-	166.8K	2024	YouTube
LSMDC	Movie	Manual	118K	118K	4.8s	7.0	200	1080P	-	158	2017	Movie
MAD	Movie	Manual	-	384K	-	12.7	650	-	-	1.2K	2022	Movie
UCF-101	Action	Manual	13K	-	7.2s	-	-	240P	25	27	2012	YouTube
ANet-200	Action	Manual	100K	-	-	13.5	2K	*720P	30	849	2015	YouTube
Charades	Action	Manual	10K	16K	-	-	10K	-	-	82	2016	Home
Kinetics	Action	Manual	306K	-	10.0s	-	306K	-	-	-	2017	YouTube
ActNet	Action	Manual	100K	100K	36.0s	13.5	20K	-	-	849	2017	YouTube
C-Ego	Action	Manual	-	-	-	-	8K	240P	-	69	2018	Home
SS-V2	Action	Manual	-	-	-	-	220.1K	-	12	-	2018	Daily
How2	Instruct	Manual	80K	80K	90.0	20.0	13.1K	-	-	2000	2018	YouTube
HT100M	Instruct	ASR	136M	136M	3.6	4.0	1.2M	240P	-	134.5K	2019	YouTube
YCook2	Cooking	Manual	14K	14K	19.6	8.8	2K	-	-	176	2018	YouTube
E-Kit	Cooking	Manual	40K	40K	-	-	432	*1080P	60	55	2018	Home

(VidProM) VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models Dataset (adding)
Team: ReLER Lab.
Wenhao Wang, Yi Yang
arXiv, 2024.03 [Paper], [PDF], [Code], [Hugging Face]
(ECTV) EvalCrafter: Benchmarking and Evaluating Large Video Generation Models Dataset (adding)
Team: Tencent AI Lab, CUHK.
Yaofang Liu, Xiaodong Cun, Xuebo Liu, et al., Ying Shan
CVPR'24, arXiv, 2023.10 [Paper], [PDF], [Code], [Dataset], [Home Page]
(CV-Text) Celebv-text: A large-scale facial text-video datase Dataset (Domain:Face)
Team: University of Sydney, SenseTime Research.
Jianhui Yu, Hao Zhu, Liming Jiang, et al., Wayne Wu
CVPR'23, arXiv, 2023.03 [Paper], [PDF], [Code], [Demo], [Home Page]
(MSR-VTT) Msr-vtt: A large video description dataset for bridging video and language Dataset (Domain:Open)
Team: Microsoft Research.
Jun Xu , Tao Mei , Ting Yao and Yong Rui
CVPR'16 [Paper], [PDF]
(DideMo) Localizing moments in video with natural language Dataset (Domain:Open)
Team: UC Berkeley, Adobe
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, et al., Bryan Russell
ICCV'17, arXiv, 2017.08 [Paper], [PDF]
(YT-Tem-180M) Merlot: Multimodal neural script knowledge models Dataset (Domain:Open)
Team: University of Washington
Rowan Zellers, Ximing Lu, Jack Hessel, et al., Yejin Choi
NeurIPS'21, arXiv, 2021.06 [Paper], [PDF], [Code], [Home Page]
(WebVid2M) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Dataset (Domain:Open)
Team: University of Oxford, CNRS.
Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman
ICCV'21, arXiv, 2021.04 [Paper], [PDF],[Dataset], [Code],[Demo], [Home Page]
(HD-VILA-100M) Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions Dataset (Domain:Open)
Team: Microsoft Research Asia.
Hongwei Xue, Tiankai Hang, Yanhong Zeng, et al., Baining Guo
CVPR'22, arXiv, 2021.11 [Paper], [PDF], [Code]
(InterVid) Internvid: A large-scale video-text dataset for multimodal understanding and generation Dataset (Domain:Open)
Team: Shanghai AI Laboratory.
Yi Wang, Yinan He, Yizhuo Li, et al., Yu Qiao
arXiv, 2023.07 [Paper], [PDF], [Code]
(HD-VG-130M) VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation Dataset (Domain:Open)
Team: Peking University, Microsoft Research.
Wenjing Wang, Huan Yang, Zixi Tuo, et al., Jiaying Liu
arXiv, 2023.05 [Paper], [PDF]
(Youku-mPLUG) Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks Dataset (Domain:Open)
Team: DAMO Academy, Alibaba Group.
Haiyang Xu, Qinghao Ye, Xuan Wu, et al., Fei Huang
arXiv, 2023.06 [Paper], [PDF]
(VAST-27M) Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset Dataset (Domain:Open)
Team: UCAS, CAS
Sihan Chen, Handong Li, Qunbo Wang, et al., Jing Liu
NeurIPS'23, arXiv, 2023.05 [Paper], [PDF]
(Panda-70M) Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers Dataset (Domain:Open)
Team: Snap Inc., University of California, University of Trento.
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Sergey Tulyakov
arXiv, 2024.02 [Paper], [PDF], [Code], [Home Page]
(LSMDC) Movie description Dataset (Domain:Movie)
Team: Max Planck Institute for Informatics.
Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, et al., Bernt Schiele
IJCV'17, arXiv, 2016.05 [Paper], [PDF], [Home Page]
(MAD) Mad: A scalable dataset for language grounding in videos from movie audio descriptions Dataset (Domain:Movie)
Team: KAUST, Adobe Research.
Mattia Soldan, Alejandro Pardo, Juan León Alcázar, et al., Bernard Ghanem
CVPR'22, arXiv, 2021.12 [Paper], [PDF], [Code]
(UCF-101) UCF101: A dataset of 101 human actions classes from videos in the wild Dataset (Domain:Action)
Team: University of Central Florida.
Khurram Soomro, Amir Roshan Zamir, Mubarak Shah
arXiv, 2012.12 [Paper], [PDF], [Data]
(ActNet-200) Activitynet: A large-scale video benchmark for human activity understanding Dataset (Domain:Action)
Team: Universidad del Norte, KAUST
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, Juan Carlos Niebles
CVPR'15, [Paper], [PDF], [Home Page]
(Charades) Hollywood in homes: Crowdsourcing data collection for activity understanding Dataset (Domain:Action)
Team: Carnegie Mellon University
Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, et al., Abhinav Gupta
ECCV'16, arXiv, 2016.04, [Paper], [PDF], [Home Page]
(Kinetics) The kinetics human action video dataset Dataset (Domain:Action)
Team: Google
Will Kay, Joao Carreira, Karen Simonyan, et al., Andrew Zisserman
arXiv, 2017.05, [Paper], [PDF], [Home Page]
(ActivityNet) Dense-captioning events in videos Dataset (Domain:Action)
Team: Stanford University
Ranjay Krishna, Kenji Hata, Frederic Ren, et al., Juan Carlos Niebles
ICCV'17, arXiv, 2017.05, [Paper], [PDF], [Home Page]
(Charades-Ego) Charades-ego: A large-scale dataset of paired third and first person videos Dataset (Domain:Action)
Team: Carnegie Mellon University
Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, et al., Karteek Alahari
arXiv, 2018.04, [Paper], [PDF], [Home Page]
(SS-V2) The "something something" video database for learning and evaluating visual common sense Dataset (Domain:Action)
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, et al., Roland Memisevic
ICCV'17, arXiv, 2017.06 [Paper], [PDF], [Home Page]
(How2) How2: a large-scale dataset for multimodal language understanding Dataset (Domain:Instruct)
Team: Carnegie Mellon University.
Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, et al., Florian Metze
arXiv, 2018.11 [Page], [PDF]
(HowTo100M) HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Dataset (Domain:Instruct)
Team: Ecole Normale Superieure, Inria, CIIRC.
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, et al., Josef Sivic
arXiv, 2019.06 [Page], [PDF], [Home Page]
(YouCook2) Towards automatic learning of procedures from web instructional video Dataset (Domain:Cooking)
Team: University of Michigan, University of Rochester
Luowei Zhou, Chenliang Xu, Jason J. Corso
AAAI'18, arXiv, 2017.03 , [Paper], [PDF],[Home Page]
(Epic-Kichens) Scaling egocentric vision: The epic-kitchens dataset Dataset (Domain:Cookding)
Team: Uni. of Bristol, Uni. of Catania, Uni. of Toronto.
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, et al., Michael Wray
ECCV'18, arXiv, 2018.04, [Paper], [PDF], [Home Page]
(PSNR/SSIM) Image quality assessment: from error visibility to structural similarity Metric (image-level)
Team: New York University.
Zhou Wang, Alan Conrad Bovik, Hamid Rahim Sheikh, E.P. Simoncelli
IEEE TIP, 2004.04. [Paper], [PDF]
(IS) Improved techniques for training gans Metric (image-level)
Team: OpenAI
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, et al., Xi Chen
NeurIPS'16, arXiv, 2016.06, [Paper], [PDF], [Code]
(FID) Gans trained by a two time-scale update rule converge to a local nash equilibrium Metric (image-level)
Team: Johannes Kepler University Linz
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, et al., Sepp Hochreiter
NeurIPS'17, arXiv, 2017.06 [Paper], [PDF]
(CLIP Score) Learning transferable visual models from natural language supervision Metric (image-level)
Team: OpenAI.
Alec Radford, Jong Wook Kim, Chris Hallacy, et al., Ilya Sutskever
ICML'21, arXiv, 2021.02 [Paper], [PDF], [Code]
(Video IS) Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan Metric (video-level)
Masaki Saito, Shunta Saito, Masanori Koyama, Sosuke Kobayashi
IJCV'20, arXiv, 2018.11 [Paper], [PDF], [Code]
(FVD/KVD) FVD: A new metric for video generation Metric (video-level)
Team: Johannes Kepler University, Google
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, et al., Sylvain Gelly
ICLR'19, arXiv, 2018.12 [Paper], [PDF], [Code]
(FCS) Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation Metric (video-level)
Team: Show Lab, National University of Singapore.
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Mike Zheng Shou et al
ICCV'23, arxiv, 2022.12[Paper], [PDF], [Code], [Pretrained Model]

Acknowledgement and References

Citation

If you find this repository useful, please consider citing our paper and this list:

@article{sun2024sora,
  title={From Sora What We Can See: A Survey of Text-to-Video Generation},
  author={Sun, Rui and Zhang, Yumin and Shah, Tejal and Sun, Jiahao and Zhang, Shuoying and Li, Wenqi and Duan, Haoran and Wei, Bo and Ranjan, Rajiv},
  journal={arXiv preprint arXiv:2405.10674},
  year={2024}
}

@misc{sun2024t2vgenerationlist,
  title={Awesome-Text-to-Video-Generation},
  author={Sun, Rui and Zhang, Yumin},
  year={2024},
  publisher={GitHub},
  howpublished={\url{https://github.com/soraw-ai/Awesome-Text-to-Video-Generation}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
imgs		imgs
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

imgs

imgs

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Awesome-Text-to-Video-Generation

Table of Content

Text-to-Seq-Image

Text-to-Video

Datasets & Metrics

Acknowledgement and References

Citation

About

Releases

Packages

Contributors 2

soraw-ai/Awesome-Text-to-Video-Generation

Folders and files

Latest commit

History

Repository files navigation

Awesome-Text-to-Video-Generation

Table of Content

Text-to-Seq-Image

Text-to-Video

Datasets & Metrics

Acknowledgement and References

Citation

About

Resources

Stars

Watchers

Forks