Skip to content

Nyandwi/MultiModal-Learning-Research

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 

Repository files navigation

Multimodal Learning Research

Multimodal Learning(MML) is a branch of machine learning that deals with designing models that can learn from multiple modalities such as vision, language, robotic actions, etc...MML is a hot area in AI research. AI systems that learn from single modality have advanced rapidly in recent times. We have witnessed language models that can understand texts, image models that can recognize images, etc...Although those systems are not perfect yet, they reasonably generalize well on the modalities they were trained on. A key challenge now is how to design AI systems that can jointly learn and generalize across multiple modalities at a large scale. The kind of systems that can understand, texts, robotic actions, speech, etc...

This repository is a collection of progress happening in multimodal learning. It features lecture videos, papers, books, and blog posts. Contributions are welcome!

What's in here:

Courses & Videos

  • Multimodal Machine Learning, Carnegie Mellon University: Lecture videos | webpage | whitepaper

  • Multi-Modal Imaging with Deep Learning and Modeling, Institute for Pure & Applied Mathematics (IPAM): Lecture videos

  • Topics in AI - Multimodal Learning with Vision, Language and Sound - University of British Columbia: Course webpage and readings

  • Advanced Topics in MultiModal Machine Learning, Carnegie Mellon University: webpage

  • Deep Learning for Multi-Modal Systems | Data Science Summer School 2022: Lecture video | Webpage

  • Topics in Computer Vision (CSC2539) - Visual Recognition with Text, University of Toronto: webpage


Relevant Workshops

  • Multimodal Machine Learning, CVPR 2022 Tutorial: Videos

  • Recent Advances in Vision and Language Pre-training - Slides and videos


Survey Papers

  • Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions: ArXiv | 2022

  • Vision-Language Pre-training: Basics, Recent Advances, and Future Trend: ArXiv 2022

  • VLP: A Survey on Vision-Language Pre-training: ArXiv | 2022

  • Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods: Arxiv | 2021

  • Multimodal Machine Learning: A Survey and Taxonomy: Arxiv| 2017


Books

  • Multimodal Deep Learning: Web | Arxiv | 2023

Papers by Categories

Features papers about multimodal representation learning and task-specific papers.

General MML Representation Learning


  • LLaVA - Visual Instruction Tuning: ArXiv | Page | Code | April 2023

  • EVA-CLIP - Improved Training Techniques for CLIP at Scale: ArXiv | Code | March 2023

Task Specific

Text-Image generation

Text-Image generation

  • Video LM: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models: ArXiv | Page | April 2023

Text-image Retrieval

Image Captioning

Visual Question Answering

Video Learning

Robotic Learning

  • A Picture is Worth a Thousand Words: Language Models Plan from Pixels: ArXiv | 2023

  • PaLM-E: An Embodied Multimodal Language ModelArXiv | Page | Blog | 2023

Applications Connecting Multimodals

  • HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace - ArXiv | Code | March 2023

  • Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models - ArXiv | Code | Colab | Spaces

  • ViperGPT: Visual Inference via Python Execution for Reasoning - Paper | Code | Page | March 2023


Blog Posts

  • Generalized Visual Language Models by Lilian Weng: blog | 2022

Related Repositories

About

A curated resources on what's happening in multimodal learning. Features recent papers, books, related lectures, and other relevant resources.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published