A Collection of Video Generation Studies

This GitHub repository summarizes papers and resources related to the video generation task.

If you have any suggestions about this repository, please feel free to start a new issue or pull requests.

Recent news of this GitHub repo are listed as follows.

🔥 Click to see more information.

[Apr. 26th] Update a new direction: Personalized Video Generation.
[Mar. 28th] The official AAAI 2024 paper list are released! Official version of PDFs and BibTeX references are updated accordingly.

To-Do Lists

Latest Papers
- Update CVPR 2024 Papers
  - Update PDFs and References of ⚠️ Papers
  - Update Published Versions of References
- Update AAAI 2024 Papers
  - Update PDFs and References of ⚠️ Papers
  - Update Published Versions of References
- Update ICLR 2024 Papers
- Update NeurIPS 2023 Papers
Previously Published Papers
- Update Previous CVPR papers
- Update Previous ICCV papers
- Update Previous ECCV papers
- Update Previous NeurIPS papers
- Update Previous ICLR papers
- Update Previous AAAI papers
- Update Previous ACM MM papers
Regular Maintenance of Preprint arXiv Papers and Missed Papers

<🎯Back to Top>

Products

Name	Organization	Year	Research Paper	Website	Specialties
Sora	OpenAI	2024	link	link	-
Lumiere	Google	2024	link	link	-
VideoPoet	Google	2023	-	link	-
W.A.I.T	Google	2023	link	link	-
Gen-2	Runaway	2023	-	link	-
Gen-1	Runaway	2023	-	link	-
Animate Anyone	Alibaba	2023	link	link	-
Outfit Anyone	Alibaba	2023	-	link	-
Stable Video	StabilityAI	2023	link	link	-
Pixeling	HiDream.ai	2023	-	link	-
DomoAI	DomoAI	2023	-	link	-
Emu	Meta	2023	link	link	-
Genmo	Genmo	2023	-	link	-
NeverEnds	NeverEnds	2023	-	link	-
Moonvalley	Moonvalley	2023	-	link	-
Morph Studio	Morph	2023	-	link	-
Pika	Pika	2023	-	link	-
PixelDance	ByteDance	2023	link	link	-

<🎯Back to Top>

Papers

Survey Papers

Year 2024
arXiv
- Video Diffusion Models: A Survey [Paper]
Year 2023
arXiv
- A Survey on Video Diffusion Models [Paper]

Text-to-Video Generation

Year 2024
- CVPR
  - Vlogger: Make Your Dream A Vlog [Paper] [Code]
  - Make Pixels Dance: High-Dynamic Video Generation [Paper] [Project] [Demo]
  - VGen: Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [Paper] [Code] [Project]
  - GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation [Paper] [Project]
  - SimDA: Simple Diffusion Adapter for Efficient Video Generation [Paper] [Code] [Project]
  - MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation [Paper] [Project] [Video]
  - Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models [Paper] [Project]
  - PEEKABOO: Interactive Video Generation via Masked-Diffusion [Paper] [Code] [Project] [Demo]
  - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [Paper] [Code] [Project]
  - A Recipe for Scaling up Text-to-Video Generation with Text-free Videos [Paper] [Code] [Project]
  - BIVDiff: A Training-free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models [Paper] [Project]
  - Mind the Time: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis [Paper] [Project]
  - Animate Anyone: Consistent and Controllable Image-to-video Synthesis for Character Animation [Paper] [Code] [Project]
  - MotionDirector: Motion Customization of Text-to-Video Diffusion Models [Paper] [Code]
  - ⚠️ Simple but Effective Text-to-Video Generation with Grid Diffusion Models Paper
  - ⚠️ Hierarchical Patch-wise Diffusion Models for High-Resolution Video Generation Paper
  - ⚠️ DiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video Generation Paper
- ICLR
  - VDT: General-purpose Video Diffusion Transformers via Mask Modeling [Paper] [Code] [Project]
  - VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation [Paper]
- AAAI
  - Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos [Paper] [Code] [Project]
  - E2HQV: High-Quality Video Generation from Event Camera via Theory-Inspired Model-Aided Deep Learning [Paper]
  - ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation [Paper] [Code] [Project]
  - F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text to-Video Synthesis [Paper]
- arXiv
  - Lumiere: A Space-Time Diffusion Model for Video Generation [Paper] [Project]
  - Boximator: Generating Rich and Controllable Motions for Video Synthesis [Paper] [Project] [Video]
  - World Model on Million-Length Video And Language With RingAttention [Paper] [Code] [Project]
  - Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion [Paper] [Project]
  - WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens [Paper] [Code] [Project]
  - MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation [Paper] [Project]
  - Latte: Latent Diffusion Transformer for Video Generation [Paper] [Code] [Project]
  - Mora: Enabling Generalist Video Generation via A Multi-Agent Framework [Paper] [Code]
  - StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text [Paper] [Code] [Project] [Video]
  - VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models [Paper]
  - StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation [Paper] [Code] [Project] [Demo]
- Others
  - Sora: Video Generation Models as World Simulators [Paper]
Year 2023
- CVPR
  - Align your Latents: High-resolution Video Synthesis with Latent Diffusion Models [Paper] [Project] [Reproduced code]
  - Text2Video-Zero: Text-to-image Diffusion Models are Zero-shot Video Generators [Paper] [Code] [Demo] [Project]
  - Video Probabilistic Diffusion Models in Projected Latent Space [Paper] [Code]
- ICCV
  - Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models [Paper] [Project]
  - Gen-1: Structure and Content-guided Video Synthesis with Diffusion Models [Paper] [Project]
- NeurIPS
  - Video Diffusion Models [Paper] [Project]
- ICLR
  - CogVideo: Large-scale Pretraining for Text-to-video Generation via Transformers [Paper] [Code] [Demo]
  - Make-A-Video: Text-to-video Generation without Text-video Data [Paper] [Project] [Reproduced code]
  - Phenaki: Variable Length Video Generation From Open Domain Textual Description [Paper] [Reproduced Code]
- arXiv
  - Control-A-Video: Controllable Text-to-video Generation with Diffusion Models [Paper] [Code] [Demo] [Project]
  - ControlVideo: Training-free Controllable Text-to-video Generation [Paper] [Code]
  - Imagen Video: High Definition Video Generation with Diffusion Models [Paper]
  - Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-video Generation [Paper] [Project]
  - LAVIE: High-quality Video Generation with Cascaded Latent Diffusion Models [Paper] [Code] [Project]
  - Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-video Generation [Paper] [Code] [Project]
  - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets [Paper] [Code] [Project]
  - VideoComposer: Compositional Video Synthesis with Motion Controllability [Paper] [Code] [Project]
  - VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-video Generation [Paper] [Dataset]
  - VideoGen: A Reference-guided Latent Diffusion Approach for High Definition Text-to-video Generation [Paper] [Code]
  - InstructVideo: Instructing Video Diffusion Models with Human Feedback [Paper] [Code] [Project]
  - Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning [Paper] [Project]
  - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction [Paper] [Code] [Project]
  - VideoLCM: Video Latent Consistency Model [Paper]
  - ModelScope Text-to-Video Technical Report [Paper] [Code]
Year 2022
- CVPR
  - Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning [Paper] [Code] [Dataset]
Year 2021
- arXiv
  - VideoGPT: Video Generation using VQ-VAE and Transformers [Paper] [Code] [Project]
  - MagicVideo: Efficient Video Generation With Latent Diffusion Models [Paper] <🎯Back to Top>

Image-to-Video Generation

Year 2024
- CVPR
  - VideoBooth: Diffusion-based Video Generation with Image Prompts [Paper] [Code] [Project] [Video]
- AAAI
  - Decouple Content and Motion for Conditional Image-to-Video Generation [Paper]
- arXiv
  - ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation [Paper] [Code] [Project]
  - I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models [Paper] [Code]
  - Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts [Paper] [Code] [Project]
  - AtomoVideo: High Fidelity Image-to-Video Generation [Paper] [Project] [Video]
  - Pix2Gif: Motion-Guided Diffusion for GIF Generation [Paper] [Code] [Project]
  - ID-Animator: Zero-Shot Identity-Preserving Human Video Generation [Paper] [Code] [Project]
Year 2023
- CVPR
  - Conditional Image-to-Video Generation with Latent Flow Diffusion Models [Paper] [Code]
- arXiv
  - I2VGen-XL: High-quality Image-to-video Synthesis via Cascaded Diffusion Models [Paper] [Code] [Project]
  - DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance [Paper] [Code] [Project]
  - DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors [Paper] [Project] [Code] [Video] [Demo]
  - AnimateDiff: Animate Your Personalized Text-to-image Diffusion Models without Specific Tuning [Paper] [Project]
Year 2022
- CVPR
  - Make It Move: Controllable Image-to-Video Generation with Text Descriptions [Paper] [Code]
Year 2021
- ICCV
  - Click to Move: Controlling Video Generation with Sparse Motion [Paper] [Code]

<🎯Back to Top>

Audio-to-Video Generation

Year 2024
- AAAI
  - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation [Paper] [Code]
Year 2023
- CVPR
  - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation [Paper] [Code]

<🎯Back to Top>

Personalized Video Generation

Year 2024
- CVPR
  - High-fidelity Person-centric Subject-to-Image Synthesis [Paper] [Code]
- arXiv
  - Magic-Me: Identity-Specific Video Customized Diffusion [Paper] [Code] [Project] [Demo]
Year 2023
- arXiv
  - FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention [Paper] [Code] [Demo]

<🎯Back to Top>

Video Editing

Year 2024
- CVPR
  - VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models [Paper] [Code] [Project]
  - Fairy: Fast Parallellized Instruction-Guided Video-to-Video Synthesis [Paper] [Project]
  - CCEdit: Creative and Controllable Video Editing via Diffusion Models [Paper] [Code] [Project] [Video]
  - DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing [Paper] [Project] [Video]
  - Video-P2P: Video Editing with Cross-attention Control [Paper] [Code] [Project]
  - A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing [Paper] [Code] [Project]
  - MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers [Paper] [Code] [Project]
  - VidToMe: Video Token Merging for Zero-Shot Video Editing [Paper] [Code] [Project] [Video]
  - Towards Language-Driven Video Inpainting via Multimodal Large Language Models [Paper] [Code] [Project] [Dataset]
  - AVID: Any-Length Video Inpainting with Diffusion Model [Paper] [Code] [Project]
  - ⚠️ CAMEL: CAusal Motion Enhancement tailored for Lifting Text-driven Video Editing [Paper]
- ICLR
  - Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models [Paper] [Code] [Project]
  - TokenFlow: Consistent Diffusion Features for Consistent Video Editing [Paper] [Code] [Project]
- arXiv
  - AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks [Paper] [Code] [Project]
  - Spectral Motion Alignment for Video Motion Transfer using Diffusion Models [Paper] [Code] [Project]
  - DreamMotion: Space-Time Self-Similarity Score Distillation for Zero-Shot Video Editing [Paper] [Project]
  - UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing [Paper] [Code] [Project]
  - DragAnything: Motion Control for Anything using Entity Representation [Paper] [Code] [Project]
Year 2023
- arXiv
  - Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer [Paper]

<🎯Back to Top>

Datasets

[arXiv 2012] UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild [Paper] [Dataset]
[ICCV 2019] FaceForensics++: Learning to Detect Manipulated Facial Images [Paper] [Code]
[NeurIPS 2019] TaiChi-HD: First Order Motion Model for Image Animation [Paper] [Dataset]
[ECCV 2020] SkyTimeLapse: DTVNet: Dynamic Time-lapse Video Generation via Single Still Image [Paper] [Code]
[ICCV 2021] WebVid-10M: Frozen in Time: ️A Joint Video and Image Encoder for End to End Retrieval [Paper] [Dataset] [Code] [Project]
[ICCV 2021] WebVid-10M: Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [Paper] [Dataset] [Project]
[ECCV 2022] ROS: Learning to Drive by Watching YouTube Videos: Action-Conditioned Contrastive Policy Pretraining [Paper] [Code] [Dataset]
[arXiv 2023] HD-VG-130M: VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-video Generation [Paper] [Dataset]
[ICLR 2024] InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation [Paper] [Dataset]
[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers [Paper] [Dataset] [Project]
[arXiv 2024] VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models [Paper] [Dataset]

<🎯Back to Top>

Q&A

Q: The conference sequence of this paper list?
- This paper list is organized according to the following sequence:
  - CVPR
  - ICCV
  - ECCV
  - NeurIPS
  - ICLR
  - AAAI
  - ACM MM
  - SIGGRAPH
  - arXiv
  - Others
Q: What does Others refers to?
- Some of the following studies (e.g., Sora) does not publish their technical report on arXiv. Instead, they tend to write a blog in their official websites. The Others category refers to such kind of studies.

<🎯Back to Top>

References

The reference.bib file summarizes bibtex references of up-to-date image inpainting papers, widely used datasets, and toolkits. Based on the original references, I have made the following modifications to make their results look nice in the LaTeX manuscripts:

Refereces are normally constructed in the form of author-etal-year-nickname. Particularly, references of datasets and toolkits are directly constructed as nickname, e.g., imagenet.
In each reference, all names of conferences/journals are converted into abbreviations, e.g., Computer Vision and Pattern Recognition -> CVPR.
The url, doi, publisher, organization, editor, series in all references are removed.
The pages of all references are added if they are missing.
All paper names are in title case. Besides, I have added an additional {} to make sure that the title case would also work well in some particular templates.

If you have other demands of reference formats, you may refer to the original references of papers by searching their names in DBLP or Google Scholar.

<🎯Back to Top>

Star History

<🎯Back to Top>

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
LICENSE		LICENSE
README.md		README.md
reference.bib		reference.bib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

reference.bib

reference.bib

Repository files navigation

A Collection of Video Generation Studies

Contents

To-Do Lists

Products

Papers

Survey Papers

Text-to-Video Generation

Image-to-Video Generation

Audio-to-Video Generation

Personalized Video Generation

Video Editing

Datasets

Q&A

References

Star History

About

Releases

Packages

Contributors 3

Languages

License

AlonzoLeeeooo/awesome-video-generation

Folders and files

Latest commit

History

Repository files navigation

A Collection of Video Generation Studies

Contents

To-Do Lists

Products

Papers

Survey Papers

Text-to-Video Generation

Image-to-Video Generation

Audio-to-Video Generation

Personalized Video Generation

Video Editing

Datasets

Q&A

References

Star History

About

Topics

Resources

License

Stars

Watchers

Forks

Languages