Multimodal Learning(MML) is a branch of machine learning that deals with designing models that can learn from multiple modalities such as vision, language, robotic actions, etc...MML is a hot area in AI research. AI systems that learn from single modality have advanced rapidly in recent times. We have witnessed language models that can understand texts, image models that can recognize images, etc...Although those systems are not perfect yet, they reasonably generalize well on the modalities they were trained on. A key challenge now is how to design AI systems that can jointly learn and generalize across multiple modalities at a large scale. The kind of systems that can understand, texts, robotic actions, speech, etc...
This repository is a collection of progress happening in multimodal learning. It features lecture videos, papers, books, and blog posts. Contributions are welcome!
What's in here:
-
Multimodal Machine Learning, Carnegie Mellon University: Lecture videos | webpage | whitepaper
-
Multi-Modal Imaging with Deep Learning and Modeling, Institute for Pure & Applied Mathematics (IPAM): Lecture videos
-
Topics in AI - Multimodal Learning with Vision, Language and Sound - University of British Columbia: Course webpage and readings
-
Advanced Topics in MultiModal Machine Learning, Carnegie Mellon University: webpage
-
Deep Learning for Multi-Modal Systems | Data Science Summer School 2022: Lecture video | Webpage
-
Topics in Computer Vision (CSC2539) - Visual Recognition with Text, University of Toronto: webpage
-
Multimodal Machine Learning, CVPR 2022 Tutorial: Videos
-
Recent Advances in Vision and Language Pre-training - Slides and videos
-
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions: ArXiv | 2022
-
Vision-Language Pre-training: Basics, Recent Advances, and Future Trend: ArXiv 2022
-
VLP: A Survey on Vision-Language Pre-training: ArXiv | 2022
-
Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods: Arxiv | 2021
-
Multimodal Machine Learning: A Survey and Taxonomy: Arxiv| 2017
Features papers about multimodal representation learning and task-specific papers.
- Learning Transferable Visual Models From Natural Language Supervision: ArXiv | Code | Blog | Colab CLIP on HF Feb 2021
-
LLaVA - Visual Instruction Tuning: ArXiv | Page | Code | April 2023
-
EVA-CLIP - Improved Training Techniques for CLIP at Scale: ArXiv | Code | March 2023
- Video LM: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models: ArXiv | Page | April 2023
-
A Picture is Worth a Thousand Words: Language Models Plan from Pixels: ArXiv | 2023
-
PaLM-E: An Embodied Multimodal Language ModelArXiv | Page | Blog | 2023
-
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace - ArXiv | Code | March 2023
-
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models - ArXiv | Code | Colab | Spaces
-
ViperGPT: Visual Inference via Python Execution for Reasoning - Paper | Code | Page | March 2023
- Generalized Visual Language Models by Lilian Weng: blog | 2022