Recent Advances in Vision-and-Language Pre-training (VLP)

Maintained by Feilong Chen. Last update on 2023/03/04.

Survey

VLP: A Survey on Vision-Language Pre-training, arXiv 2022

Image-based VLP

Representation Learning

Learning Transferable Visual Models From Natural Language Supervision, CLIP, ICML 2021, [code]
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019 [code]
LXMERT: Learning Cross-Modality Encoder Representations from Transformers, EMNLP 2019 [code]
VL-BERT: Pre-training of Generic Visual-Linguistic Representations, ICLR 2020 [code]
VisualBERT: A Simple and Performant Baseline for Vision and Language, arXiv 2019/08, ACL 2020 [code]
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, AAAI 2020
Unified Vision-Language Pre-Training for Image Captioning and VQA, AAAI 2020, [code], (VLP)
UNITER: Learning Universal Image-text Representations, ECCV 2020, [code]
Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks, arXiv 2019/12
InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining, arXiv 2020/03
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, ECCV 2020, [code]
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers, arXiv 2020/04
ERNIE-VIL: KNOWLEDGE ENHANCED VISION-LANGUAGE REPRESENTATIONS THROUGH SCENE GRAPH, arXiv 2020/06
DeVLBert: Learning Deconfounded Visio-Linguistic Representations, ACM MM 2020, [code]
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers, EMNLP 2020
SEMVLP: VISION-LANGUAGE PRE-TRAINING BY ALIGNING SEMANTICS AT MULTIPLE LEVELS, ICLR 2021 submission
CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations, arXiv 2020/10
Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs, arXiv 2020/11
LAMP: Label Augmented Multimodal Pretraining, arXiv 2020/12
Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network, AAAI 2021
VinVL: Revisiting Visual Representations in Vision-Language Models, CVPR 2021, [code]
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, ICML 2021, [code]
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation, arXiv 2021
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning, ACL 2021, [code]
How Much Can CLIP Benefit Vision-and-Language Tasks?, arXiv 2021, [code]
Unifying Vision-and-Language Tasks via Text Generation, ICML 2021, [code]
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs, ACL 2021, [code]
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision, arXiv 2021
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, arXiv 2021, [code]
Kaleido-BERT: Vision-Language Pre-training on Fashion Domain, CVPR2021, [code]
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts, ICML 2022, [code]
Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022, [code]
Unpaired Vision-Language Pre-training via Cross-Modal CutMix, ICML 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, ICML 22, [code]
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework, ICML 22, [code]
GIT: A Generative Image-to-text Transformer for Vision and Language, arXiv 2022, [code]
CoCa: Contrastive Captioners are Image-Text Foundation Models, arXiv 2022, [code]
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, arXiv 2022, [code]
PaLI: A Jointly-Scaled Multilingual Language-Image Model, arXiv 2022
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, arXiv 2023
Language Is Not All You Need: Aligning Perception with Language Models, arXiv 2023, [code]
Unifying Vision-Language Representation Space with Single-tower Transformer, AAAI 2023

Task-specific

Other Analysis

Multi-task Learning, 12-in-1: Multi-Task Vision and Language Representation Learning, CVPR 2020, [code]
Multi-task Learning, Unifying Vision-and-Language Tasks via Text Generation, arXiv 2021/02
Social Bias in VL Embedding, Measuring Social Biases in Grounded Vision and Language Embeddings, arXiv 2020/02, [code]
In-depth Analysis, Are we pretraining it right? Digging deeper into visio-linguistic pretraining,
In-depth Analysis, Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models, ECCV 2020 Spotlight
In-depth Analysis, A Closer Look at the Robustness of Vision-and-Language Pre-trained Models, arXiv 2020/12
Adversarial Training, Large-Scale Adversarial Training for Vision-and-Language Representation Learning, NeurIPS 2020 Spotlight
Adaptive Analysis, Adaptive Transformers for Learning Multimodal Representations, ACL SRW 2020
Neural Architecture Search, Deep Multimodal Neural Architecture Search, arXiv 2020/04
Dataset perspective, Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision, arXiv 2021/02

Video-based VLP

VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019
Learning Video Representations Using Contrastive Bidirectional Transformers, arXiv 2019/06, (CBT)
M-BERT: Injecting Multimodal Information in the BERT Structure, arXiv 2019/08
BERT for Large-scale Video Segment Classification with Test-time Augmentation, ICCV 2019 YouTube8M workshop, [code]
Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog, AAAI2020 DSTC8 workshop
Learning Spatiotemporal Features via Video and Text Pair Discrimination, arXiv 2020/01, (CPD), [code]
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation, arXiv 2020/02
ActBERT: Learning Global-Local Video-Text Representations, CVPR 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training, EMNLP 2020
Video-Grounded Dialogues with Pretrained Generation Language Models, ACL 2020
Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training, arXiv 2020/07
Multimodal Pretraining for Dense Video Captioning, arXiv 2020/11
PARAMETER EFFICIENT MULTIMODAL TRANSFORMERS FOR VIDEO REPRESENTATION LEARNING, arXiv 2020/12
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling, CVPR 2021

Other Transformer-based multimodal networks

Multi-Modality Cross Attention Network for Image and Sentence Matching, ICCV 2020
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning, ACL 2020
History for Visual Dialog: Do we really need it?, ACL 2020
Cross-Modality Relevance for Reasoning on Language and Vision, ACL 2020

Other Resources

Two recent surveys on pretrained language models
- Pre-trained Models for Natural Language Processing: A Survey, arXiv 2020/03
- A Survey on Contextual Embeddings, arXiv 2020/03
Other surveys about multimodal research
- Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods, arXiv 2019
- Deep Multimodal Representation Learning: A Survey, arXiv 2019
- Multimodal Machine Learning: A Survey and Taxonomy, TPAMI 2018
- A Comprehensive Survey of Deep Learning for Image Captioning, ACM Computing Surveys 2018
Other repositories of relevant reading list
Simple Survey on VLP
- VLP Survey on Representation Learning, Feilong Chen, BaiduYun password:bujb
- VLP Survey on Multimodal Retrieval, Duoduo Feng, BaiduYun, password:xobv

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Recent Advances in Vision-and-Language Pre-training (VLP)

Table of Contents

Survey

Image-based VLP

Representation Learning

Task-specific

Image Caption

VQA

Visual Dialog

Text-Image Retrieval

Visual Language Navigation

Visual Machine Reading Comprehension

Other Tasks

Other Analysis

Video-based VLP

Other Transformer-based multimodal networks

Other Resources

About

Releases

Packages

Contributors 4

License

phellonchen/awesome-Vision-and-Language-Pre-training

Folders and files

Latest commit

History

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Recent Advances in Vision-and-Language Pre-training (VLP)

Table of Contents

Survey

Image-based VLP

Representation Learning

Task-specific

Image Caption

VQA

Visual Dialog

Text-Image Retrieval

Visual Language Navigation

Visual Machine Reading Comprehension

Other Tasks

Other Analysis

Video-based VLP

Other Transformer-based multimodal networks

Other Resources

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Packages