Skip to content

mlfoundations/clip_quality_not_quantity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP

This repository contains code for downloading the six datasets used in the paper Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP by Thao Nguyen, Gabriel Ilharco, Mitchell Wortsman, Sewoong Oh, Ludwig Schmidt.

TLDR: We investigate how pre-training on different web-crawled data sources would affect CLIP's robustness to natural distribution shifts, and find that the robustness induced by each pre-training dataset varies widely. By analyzing the interactions of these datasets through both experiments and theoretical analyses, we also observe that simply combining multiple datasets dilutes the robustness of the best-performing one.

Abstract

Web-crawled datasets have enabled remarkable generalization capabilities in recent image-text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little is known about the dataset creation processes. In this work, we introduce a testbed of six publicly available data sources - YFCC, LAION, Conceptual Captions, WIT, RedCaps, Shutterstock - to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the pre-training data varies substantially across distribution shifts, with no single data source dominating. Moreover, we systematically study the interactions between these data sources and find that combining multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source. We complement our empirical findings with theoretical insights from a simple setting, where combining the training data also results in diluted robustness. In addition, our theoretical model provides a candidate explanation for the success of the CLIP-based data filtering technique recently employed in the LAION dataset. Overall our results demonstrate that simply gathering a large amount of data from the web is not the most effective way to build a pre-training dataset for robust generalization, necessitating further study into dataset design.

Datasets

Each folder with a dataset name contains the relevant code to download data instances into WebDataset format given the corresponding metadata:

  • YFCC: we used the 15M subset of the YFCC100M dataset that the original CLIP paper used for dataset ablation studies. The corresponding metadata can be found here
  • LAION: our subset can be found here. We only included samples from the original LAION dataset where the 'NSFW' tag is marked as 'UNLIKELY'. Note that this subset contains slightly more than 15M samples to account for a small fraction of bad URLs and throttled instances during the downloading process
  • CC12M: we obtain the training set from the official data release
  • WIT: refer to the official data release for the metadata. Note that for WIT, due to heavy throttling issues, we could not download data in parallel within a single machine. Users may consider using AWS instances to speed up this process
  • RedCaps: the initial dataset release grouped instances into topics with the following format <subreddit>_<year>.json. For our paper, we shuffled RedCaps instances across all of these json files before starting the downloading process, in order to randomize the training data (see shuffle_annotations.py). The updated json files can be found here
  • Shutterstock: the metadata can be found here

Training

Our paper uses the OpenCLIP repository for training CLIP models.

Evaluation

Example code for evaluating OpenCLIP pre-trained models on a range of downstream settings can be found at the CLIP_benchmark repository.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages