Skip to content

Latest commit

 

History

History
10 lines (6 loc) · 696 Bytes

datacomp.md

File metadata and controls

10 lines (6 loc) · 696 Bytes

DataComp-1B

DataComp-1B is a dataset with 1.4 billion image-text pairs collected from Common Crawl and subsequently filtered. DataComp-1B is derived from CommonPool, as part of DataComp, a benchmark for designing multimodal datasets. DataComp-1B comprises the best performing subset of the xlarge version of CommonPool found by Gadre et al., 2023. See http://datacomp.ai/ and https://arxiv.org/abs/2304.14108 for details.

Downloading DataComp-1B

CommonPool can be downloaded using img2dataset by following the instructions on https://github.com/mlfoundations/datacomp/tree/main#downloading-datacomp-1b