Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP

This repository contains code for downloading the six datasets used in the paper Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP by Thao Nguyen, Gabriel Ilharco, Mitchell Wortsman, Sewoong Oh, Ludwig Schmidt.

TLDR: We investigate how pre-training on different web-crawled data sources would affect CLIP's robustness to natural distribution shifts, and find that the robustness induced by each pre-training dataset varies widely. By analyzing the interactions of these datasets through both experiments and theoretical analyses, we also observe that simply combining multiple datasets dilutes the robustness of the best-performing one.

Abstract

Web-crawled datasets have enabled remarkable generalization capabilities in recent image-text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little is known about the dataset creation processes. In this work, we introduce a testbed of six publicly available data sources - YFCC, LAION, Conceptual Captions, WIT, RedCaps, Shutterstock - to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the pre-training data varies substantially across distribution shifts, with no single data source dominating. Moreover, we systematically study the interactions between these data sources and find that combining multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source. We complement our empirical findings with theoretical insights from a simple setting, where combining the training data also results in diluted robustness. In addition, our theoretical model provides a candidate explanation for the success of the CLIP-based data filtering technique recently employed in the LAION dataset. Overall our results demonstrate that simply gathering a large amount of data from the web is not the most effective way to build a pre-training dataset for robust generalization, necessitating further study into dataset design.

Datasets

Each folder with a dataset name contains the relevant code to download data instances into WebDataset format given the corresponding metadata:

YFCC: we used the 15M subset of the YFCC100M dataset that the original CLIP paper used for dataset ablation studies. The corresponding metadata can be found here
LAION: our subset can be found here. We only included samples from the original LAION dataset where the 'NSFW' tag is marked as 'UNLIKELY'. Note that this subset contains slightly more than 15M samples to account for a small fraction of bad URLs and throttled instances during the downloading process
CC12M: we obtain the training set from the official data release
WIT: refer to the official data release for the metadata. Note that for WIT, due to heavy throttling issues, we could not download data in parallel within a single machine. Users may consider using AWS instances to speed up this process
RedCaps: the initial dataset release grouped instances into topics with the following format <subreddit>_<year>.json. For our paper, we shuffled RedCaps instances across all of these json files before starting the downloading process, in order to randomize the training data (see shuffle_annotations.py). The updated json files can be found here
Shutterstock: the metadata can be found here

Training

Our paper uses the OpenCLIP repository for training CLIP models.

Evaluation

Example code for evaluating OpenCLIP pre-trained models on a range of downstream settings can be found at the CLIP_benchmark repository.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
cc12m		cc12m
laion		laion
redcaps		redcaps
shutterstock		shutterstock
wit		wit
yfcc		yfcc
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cc12m

cc12m

laion

laion

redcaps

redcaps

shutterstock

shutterstock

wit

wit

yfcc

yfcc

README.md

README.md

Repository files navigation

Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP

Abstract

Datasets

Training

Evaluation

About

Releases

Packages

Languages

mlfoundations/clip_quality_not_quantity

Folders and files

Latest commit

History

Repository files navigation

Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP

Abstract

Datasets

Training

Evaluation

About

Resources

Stars

Watchers

Forks

Languages