Skip to content

jianzhnie/MultimodalTransformers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MultimodalTransformers

CLIP

This is a simple implementation of Natural Language-based Image Search inspired by the CLIP approach as proposed by the paper Learning Transferable Visual Models From Natural Language Supervision by OpenAI in PyTorch Lightning. We also use Weights & Biases for experiment tracking, visualizing results, comparing performance of different backbone models, hyperparameter optimization and to ensure reproducibility.

python examples/train_clip.py

This command will initialize a CLIP model with a ResNet50 image backbone and a distilbert-base-uncased text backbone.

📚 CLIP: Connecting Text and Images

CLIP (Contrastive Language–Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. This behavior turns CLIP into a zero-shot classifier. All of a dataset’s classes are converted into captions such as “a photo of a dog” followed by predicting the class of the caption in which CLIP estimates best pairs with a given image.

You can read more about CLIP here and here

💿 Dataset

This implementation of CLIP supports training on two datasets Flickr8k which contains ~8K images with 5 captions for each image and Flickr30k which contains ~30K images with corresponding captions.

🤖 Model

A CLIP model uses a text encoder and an image encoder. This repostiry supports pulling image models from PyTorch Image Models and transformer models from huggingface transformers.

Releases

No releases published

Packages

No packages published

Languages