Multimodal Learning Research

Multimodal Learning(MML) is a branch of machine learning that deals with designing models that can learn from multiple modalities such as vision, language, robotic actions, etc...MML is a hot area in AI research. AI systems that learn from single modality have advanced rapidly in recent times. We have witnessed language models that can understand texts, image models that can recognize images, etc...Although those systems are not perfect yet, they reasonably generalize well on the modalities they were trained on. A key challenge now is how to design AI systems that can jointly learn and generalize across multiple modalities at a large scale. The kind of systems that can understand, texts, robotic actions, speech, etc...

This repository is a collection of progress happening in multimodal learning. It features lecture videos, papers, books, and blog posts. Contributions are welcome!

What's in here:

Courses & Lecture Videos
Relevant Workshops
Survey Papers
Books
Papers
Blog posts

Courses & Videos

Multimodal Machine Learning, Carnegie Mellon University: Lecture videos | webpage | whitepaper
Multi-Modal Imaging with Deep Learning and Modeling, Institute for Pure & Applied Mathematics (IPAM): Lecture videos
Topics in AI - Multimodal Learning with Vision, Language and Sound - University of British Columbia: Course webpage and readings
Advanced Topics in MultiModal Machine Learning, Carnegie Mellon University: webpage
Deep Learning for Multi-Modal Systems | Data Science Summer School 2022: Lecture video | Webpage
Topics in Computer Vision (CSC2539) - Visual Recognition with Text, University of Toronto: webpage

Relevant Workshops

Multimodal Machine Learning, CVPR 2022 Tutorial: Videos
Recent Advances in Vision and Language Pre-training - Slides and videos

Survey Papers

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions: ArXiv | 2022
Vision-Language Pre-training: Basics, Recent Advances, and Future Trend: ArXiv 2022
VLP: A Survey on Vision-Language Pre-training: ArXiv | 2022
Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods: Arxiv | 2021
Multimodal Machine Learning: A Survey and Taxonomy: Arxiv| 2017

Books

Multimodal Deep Learning: Web | Arxiv | 2023

Papers by Categories

Features papers about multimodal representation learning and task-specific papers.

General MML Representation Learning

Learning Transferable Visual Models From Natural Language Supervision: ArXiv | Code | Blog | Colab CLIP on HF Feb 2021

LLaVA - Visual Instruction Tuning: ArXiv | Page | Code | April 2023
EVA-CLIP - Improved Training Techniques for CLIP at Scale: ArXiv | Code | March 2023

Task Specific

Text-Image generation

Video LM: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models: ArXiv | Page | April 2023

Text-image Retrieval

Image Captioning

Visual Question Answering

Video Learning

Robotic Learning

A Picture is Worth a Thousand Words: Language Models Plan from Pixels: ArXiv | 2023
PaLM-E: An Embodied Multimodal Language ModelArXiv | Page | Blog | 2023

Applications Connecting Multimodals

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace - ArXiv | Code | March 2023
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models - ArXiv | Code | Colab | Spaces
ViperGPT: Visual Inference via Python Execution for Reasoning - Paper | Code | Page | March 2023

Blog Posts

Generalized Visual Language Models by Lilian Weng: blog | 2022

Related Repositories

Awesome Vision and Language Pretraining

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

Multimodal Learning Research

Courses & Videos

Relevant Workshops

Survey Papers

Books

Papers by Categories

General MML Representation Learning

Task Specific

Text-Image generation

Text-Image generation

Text-image Retrieval

Image Captioning

Visual Question Answering

Video Learning

Robotic Learning

Applications Connecting Multimodals

Blog Posts

Related Repositories

About

Releases

Packages

Nyandwi/MultiModal-Learning-Research

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

Multimodal Learning Research

Courses & Videos

Relevant Workshops

Survey Papers

Books

Papers by Categories

General MML Representation Learning

Task Specific

Text-Image generation

Text-Image generation

Text-image Retrieval

Image Captioning

Visual Question Answering

Video Learning

Robotic Learning

Applications Connecting Multimodals

Blog Posts

Related Repositories

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages