Awesome Multimodality 🎶📜

A collection of resources on multimodal learning research.

Content

- 1. Description
- 2. Topic Order
- 3. Chronological Order
- - 2023
- - 2022
- - 2021
- - 2020
- 4. Courses
Contact Me

1.Description

🐌 Markdown Format:

(Conference/Journal Year) Title, First Author et al. [Paper] [Code] [Project]

(Conference/Journal Year) [💬Topic] Title, First Author et al. [Paper] [Code] [Project]

(Optional) 🌱 or 📌

(Optional) 🚀 or 👑 or 📚

🌱: Novel idea
📌: The first...
🚀: State-of-the-Art
👑: Novel dataset/model
📚：Downstream Tasks

2. Topic Order

Survey
- (TPAMI 2023) Multimodal Image Synthesis and Editing: A Survey and Taxonomy, Fangneng Zhan et al. [v1](2021.12.27) ... [v5](2023.08.05)
- (TPAMI 2023) [💬Transformer] Multimodal Learning with Transformers: A Survey, Peng Xu et al. [v1](2022.06.13) [v2](2023.05.11)
- (Multimedia Tools and Applications) A comprehensive survey on generative adversarial networks used for synthesizing multimedia content, Lalit Kumar & Dushyant Kumar Singh [v1](2023.03.30)
- ⭐⭐(arXiv preprint 2023) Multimodal Deep Learning, Cem Akkus et al. [v1](2023.01.12)
- ⭐(arXiv preprint 2022) [💬Knowledge Enhanced] A survey on knowledge-enhanced multimodal learning, Maria Lymperaiou et al. [v1](2022.11.19)
- ⭐⭐(arXiv preprint 2022) Vision-Language Pre-training: Basics, Recent Advances, and Future Trends, Zhe Gan et al. [v1](2022.10.17)
- ⭐(arXiv preprint 2022) Vision+X: A Survey on Multimodal Learning in the Light of Data, Ye Zhu et al. [v1](2022.10.05)
- (arXiv preprint 2022) Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions, Paul Pu Liang et al. [v1](2022.09.07)
- (arXiv preprint 2022) [💬Cardiac Image Computing] Multi-Modality Cardiac Image Computing: A Survey, Lei Li et al. [v1](2022.08.26)
- (arXiv preprint 2022) [💬Vision and language Pre-training (VLP)] Vision-and-Language Pretraining, Thong Nguyen et al. [v1](2022.07.05)
- (arXiv preprint 2022) [💬Video Saliency Detection] A Comprehensive Survey on Video Saliency Detection with Auditory Information: the Audio-visual Consistency Perceptual is the Key!, Chenglizhao Chen et al. [v1](2022.06.20)
- (arXiv preprint 2022) [💬Vision and language Pre-training (VLP)] Vision-and-Language Pretrained Models: A Survey, Siqu Long et al. [v1](2022.04.15)...[v5](2022.05.03)
- (arXiv preprint 2022) [💬Vision and language Pre-training (VLP)] VLP: A Survey on Vision-Language Pre-training, Feilong Chen et al. [v1](2022.02.18) [v2](2022.02.21)
- (arXiv preprint 2022) [💬Vision and language Pre-training (VLP)] A Survey of Vision-Language Pre-Trained Models, Yifan Du et al. [v1](2022.02.18)
- (arXiv preprint 2022) [💬Multi-Modal Knowledge Graph] Multi-Modal Knowledge Graph Construction and Application: A Survey, Xiangru Zhu et al. [v1](2022.02.11)
- (arXiv preprint 2022) [💬Auto Driving] Multi-modal Sensor Fusion for Auto Driving Perception: A Survey, Keli Huang et al. [v1](2022.02.06) [v2](2022.02.27)
- (arXiv preprint 2021) A Survey on Multi-modal Summarization, Anubhav Jangra et al. [v1](2021.09.11)
- (Information Fusion 2021) [💬Vision and language] Multimodal research in vision and language: A review of current and emerging trends, ShagunUppal et al. [v1](2021.08.01)
👑 Dataset
- (arXiv preprint 2023) Sticker820K: Empowering Interactive Retrieval with Stickers, Sijie Zhao et al. [Paper] [Github]
- (arXiv preprint 2023) Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration, Chenyang Lyu et al. [Paper] [Github]
- (arXiv preprint 2022) Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework, Jiaxi Gu et al. [Paper] [Download]
  - The Noah-Wukong dataset is a large-scale multi-modality Chinese dataset.
  - The dataset contains 100 Million <image, text> pairs
  - Images in the datasets are filtered according to the size ( > 200px for both dimensions ) and aspect ratio ( 1/3 ~ 3 )
  - Text in the datasets are filtered according to its language, length and frequency. Privacy and sensitive words are also taken into consideration.
- (arXiv preprint 2022) WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models, Sha Yuan et al. [Paper] [Download]
💬 Vision and language Pre-training (VLP)
- (arXiv preprint 2023) mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video, Haiyang Xu et al. [Paper] [Code]
  - 📚 Downstream Tasks:
    - [Vision Only] Video Action Recognition, Image Classification, Object Detection and Segmentation
    - [Language Only] Natural Language Understanding, Natural Language Generation
    - [Video-Text] Text-to-Video Retrieval, Video Question Answering, Video Captioning
    - [Image-Text] Image-Text Retrieval, Visual Question Answering, Image Captioning, Visual Grounding
- (EMNLP 2022) FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning, Suvir Mirchandani et al. [Paper]
  - 📚 Downstream Tasks: Image-to-Text Retrieval & Text-to-Image Retrieval, Image Retrieval with Text Feedback, Category Recognition & Subcategory Recognition, Image Captioning, Relative Image Captioning
- (arXiv preprint 2022) PaLI: A Jointly-Scaled Multilingual Language-Image Model, Xi Chen et al. [Paper]
  - 📚 Downstream Tasks: Image Captioning, Visual Question Answering (VQA), Language-understanding Capabilities, Zero-shot Image Classification
- ⭐⭐(arXiv preprint 2022) Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, Wenhui Wang et al. [Paper] [Code]
  - 📚 【Visual-Language】Visual Question Answering (VQA), Visual Reasoning, Image Captioning, Image-Text Retrieval
  - 📚 【Visual】Object Detection, nstance Segmentation, Semantic Segmentation, Image Classification
- (ECCV 2022) Exploiting Unlabeled Data with Vision and Language Models for Object Detection, Shiyu Zhao et al. [Paper] [Code]
  - 📚 Downstream Tasks: Open-vocabulary object detection, Semi-supervised object detection, Pseudo label generation
- ⭐⭐[CVPR 2022 Tutorial] Recent Advances in Vision-and-Language Pre-training [Project]
- ⭐⭐(arXiv preprint 2022) [💬Data Augmentation] MixGen: A New Multi-Modal Data Augmentation, Xiaoshuai Hao et al. [Paper]
  - 📚 Downstream Tasks: Image-Text Retrieval, Visual Question Answering (VQA), Visual Grounding, Visual Reasoning, Visual Entailment
- ⭐⭐(ICML 2022) Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts, Yan Zeng et al. [Paper] [Code]
  - 🚀 SOTA(2022/06/16): Cross-Modal Retrieval on COCO 2014 & Flickr30k, Visual Grounding on RefCOCO+ val & RefCOCO+ testA, RefCOCO+ testB
  - 📚 Downstream Tasks: Image-Text Retrieval, Visual Question Answering (VQA), Natural Language for Visual Reasoning (NLVR2), Visual Grounding, Image Captioning
- ⭐⭐(arXiv preprint 2022) Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts, Basil Mustafa et al. [Paper] [Blog]
  - 📌 LIMoE: The first large-scale multimodal mixture of experts models.
- (CVPR 2022) Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment, Mingyang Zhou et al. [Paper] [Code]
  - 📚 Downstream Tasks: Visual Question Answering(VQA), Natural Language for Visual reasoning(NLVR2), Visual Entailment, Referring Expression(RefCOCO+)
- ⭐(arXiv preprint 2022) One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code, Yong Dai et al. [Paper]
  - 📚 Downstream Tasks: Text Classification, Automatic Speech Recognition, Text-to-Image Retrieval, Text-to-Video Retrieval, Text-to-Code Retrieval
- (arXiv preprint 2022) Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework, Chunyu Xie et al. [Paper]
  - 📚 Downstream Tasks: Image-text Retrieval, Chinese Image-text matching
- (arXiv preprint 2022) Vision-Language Pre-Training with Triple Contrastive Learning, Jinyu Yang et al. [Paper] [Code]
  - 📚 Downstream Tasks: Image-text Retrieval, Visual Question Answering, Visual Entailment, Visual Reasoning
- (arXiv preprint 2022) MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment, Zejun Li et al. [Paper]
  - 📚 Downstream Tasks: Image-text Retrieval, Multi-Modal Classification, Visual Grounding
- (arXiv preprint 2022) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, Junnan Li et al. [Paper] [Code]
  - 📚 Downstream Tasks: Image-text Retrieval, Image Captioning, Visual Question Answering, Visual Reasoning, Visual Dialog
- (ICML 2021) ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, Wonjae Kim et al. [Paper]
  - 📚 Downstream Tasks: Image Text Matching, Masked Language Modeling

3. Chronological Order

2023
- (arXiv preprint 2023) Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation, Zhiwei Zhang et al. [Paper] [Project] [Code]
- (arXiv preprint 2023) Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models, Gen Luo et al. [Paper] [Project] [Code]
- ⭐⭐(arXiv preprint 2023) Any-to-Any Generation via Composable Diffusion, Zineng Tang et al. [Paper] [Project] [Code]
  - 📚[Single-to-Single Generation] Text → Image, Audio → Image, Image → Video, Image → Audio, Audio → Text, Image → Text
  - 📚[Multi-Outputs Joint Generation] Text → Video + Audio, Text → Text + Audio + Image, Text + Image → Text + Image
  - 📚[Multiple Conditioning] Text + Audio → Image, Text + Image → Image, Text + Audio + Image → Image, Text + Audio → Video, Text + Image → Video, Video + Audio → Text, Image + Audio → Audio, Text + Image → Audio
- ⭐⭐(arXiv preprint 2023) mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality, Qinghao Ye et al. [Paper] [Demo] [Code]
- (arXiv preprint 2023) Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models, Zhiqiu Lin et al. [Paper] [Project] [Code]
2022
- (arXiv preprint 2022) [💬Visual Metaphors] MetaCLUE: Towards Comprehensive Visual Metaphors Research, Arjun R. Akula et al. [Paper] [Project]
- (arXiv preprint 2022) MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks, Letitia Parcalabescu et al. [Paper] [Code]
- (arXiv preprint 2022) Versatile Diffusion: Text, Images and Variations All in One Diffusion Model, Xingqian Xu et al. [Paper] [Code] [Hugging Face]
  - 📚 Downstream Tasks: Text-to-Image, Image-Variation, Image-to-Text, Disentanglement, Text+Image-Guided Generation, Editable I2T2I
- (Machine Intelligence Research) [💬Vision-language transformer] Masked Vision-Language Transformer in Fashion, Ge-Peng Ji et al. [Paper] [Code]
- (arXiv 2022) [💬Multimodal Modeling] MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning, Zijia Zhao et al. [Paper]
- (arXiv 2022) [💬Navigation] Iterative Vision-and-Language Navigation, Jacob Krantz et al. [Paper]
- (arXiv 2022) [💬Video Chapter Generation] Multi-modal Video Chapter Generation, Xiao Cao et al. [Paper]
- (arXiv 2022) [💬Visual Question Answering (VQA)] TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation, Jun Wang et al. [Paper] [Code]
- (AI Ethics and Society 2022) [💬Multi-modal & Bias] American == White in Multimodal Language-and-Image AI, Robert Wolfe et al. [Paper]
- (Interspeech 2022) [💬Audio-Visual Speech Separation] Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation, Xiaoyu Wang et al. [Paper]
- (arXiv preprint 2022) [💬Multi-modal for Recommendation] Personalized Showcases: Generating Multi-Modal Explanations for Recommendations, An Yan et al. [Paper]
- (CVPR 2022) [💬Video Synthesis] Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning, Ligong Han et al. [Paper] [Code] [Project]
- (NAACL 2022) [💬Dialogue State Tracking] Multimodal Dialogue State Tracking, Hung Le et al. [Paper]
- (arXiv preprint 2022) [💬Multi-modal Multi-task] MultiMAE: Multi-modal Multi-task Masked Autoencoders, Roman Bachmann et al. [Paper] [Code] [Project]
- (CVPR 2022) [💬Text-Video Retrieval] X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval, Satya Krishna Gorti et al. [Paper] [Code] [Project]
- (NAACL 2022 2022) [💬Visual Commonsense] Visual Commonsense in Pretrained Unimodal and Multimodal Models, Chenyu Zhang et al. [Paper] [Code]
- (arXiv preprint 2022) [💬Pretraining framework] i-Code: An Integrative and Composable Multimodal Learning Framework, Ziyi Yang et al. [Paper]
- (CVPR 2022) [💬Food Retrieval] Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval, Mustafa Shukor et al. [Paper] [Code]
- (arXiv preprint 2022) [💬Image+Videos+3D Data Recognition] Omnivore: A Single Model for Many Visual Modalities, Rohit Girdhar et al. [Paper] [Code] [Project]
- (arXiv preprint 2022) [💬Hyper-text Language-image Model] CM3: A Causal Masked Multimodal Model of the Internet, Armen Aghajanyan et al. [Paper]
2021
- (arXiv preprint 2021) [💬Visual Synthesis] NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion, Chenfei Wu et al. [Paper] [Code]
(From: https://github.com/microsoft/NUWA [2021/11/30])
- (ICCV 2021) [💬Video-Text Alignment] TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment, Jianwei Yang et al. [Paper]
- (arXiv preprint 2021) [💬Class-agnostic Object Detection] Multi-modal Transformers Excel at Class-agnostic Object Detection, Muhammad Maaz et al. [Paper] [Code]
- (ACMMM 2021) [💬Video-Text Retrieval] HANet: Hierarchical Alignment Networks for Video-Text Retrieval, Peng Wu et al. [Paper] [Code]
- (ICCV 2021) [💬Video Recognition] AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition, Rameswar Panda et al. [Paper] [Project] [Code]
- (ICCV 2021) [💬Video Representation] CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations, Mohammadreza Zolfaghari et al. [Paper]
- (ICCV 2021 Oral) [💬Text-guided Image Manipulation] StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery, Or Patashnik et al. [Paper] [Code] [Play]
- (ICCV 2021) [💬Facial Editing] Talk-to-Edit: Fine-Grained Facial Editing via Dialog, Yuming Jiang et al. [Paper] [Code] [Project] [Dataset Project] [Dataset(CelebA-Dialog Dataset)]
- (arXiv preprint 2021) [💬Video Action Recognition] ActionCLIP: A New Paradigm for Video Action Recognition, Mengmeng Wang et al. [Paper]
2020
- (EMNLP 2020) [💬Video+Language Pre-training] HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training, Linjie Li et al. [Paper] [Code]

3.Courses

CMU MultiModal Machine Learning

Contact Me

Yutong ZHOU in Interaction Laboratory, Ritsumeikan University. ଘ(੭*ˊᵕˋ)੭
If you have any question, please feel free to contact Yutong ZHOU (E-mail: zhou@i.ci.ritsumei.ac.jp).

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
pic		pic
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pic

pic

README.md

README.md

Repository files navigation

Awesome Multimodality 🎶📜

Content

1.Description

2. Topic Order

3. Chronological Order

3.Courses

Contact Me

About

Releases

Packages

Contributors 3

Yutong-Zhou-cv/Awesome-Multimodality

Folders and files

Latest commit

History

pic

pic

README.md

README.md

Repository files navigation

Awesome Multimodality 🎶📜

Content

1.Description

2. Topic Order

3. Chronological Order

3.Courses

Contact Me

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages