Sieve is a data curation method for vision-language data, enabling training highly performant CLIP models. Sieve works on assessing the alignment between webscale image-text pairs using image-captioning models pretrained on small, diverse and highly curated datasets. To bridge the gap between the limited diversity of the generated sythentic captions (i.e., by the image-captioning models) and the original alt-text, we propose quantifying the semantic alignment of the generated sythentic caption and the original alt-text using a sentence transformer pretrained on a large corpus of text data. This alignment is then used as a proxy for how well-aligned the image content with their corresponding text.
conda env create --name Sieve --file=environment.yml
To activate the environment:
conda activate Sieve
The webdataset_inference.py
script creates a webdataset dataloader and can be used to compute the alignment scores using Sieve and optionally CLIPScore. The input shards are distributed amongst available nodes and the output is a .parquet
file per gpu. The parquet file contains information about each sample and the computed alignment scores.
- To compute Sieve scores using
--captioning
option:
torchrun --nnodes <NUM_NODES> --nproc_per_node <NUM_GPUS> webdataset_inference.py \
--data_dir <PATH_TO_SHARD_TAR_FILES> \
--output_dir <OUTPUT_DIR>" \
--captioning \
--model_url "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_14M.pth"
The --output_dir
directory will contain the output parquet files. By default we generate 8 captions per image and only the caption with the higest sentence similarity is saved in the parquet files. To save all generated captions add option --save_all_captions
. This is usefull for post processing (i.e., removing medium words and then recomputing alignment scores or recomputed alignment scores using different sentence encoders).
- To compute Sieve and CLIPScore alignment scores use
--—clipcap
option:
torchrun --nnodes <NUM_NODES> --nproc_per_node <NUM_GPUS> webdataset_inference.py \
--data_dir <PATH_TO_SHARD_TAR_FILES> \
--output_dir <OUTPUT_DIR>" \
—-clipcap \
--model_url "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_14M.pth"
Finally, you can also use --cliping
to only compute CLIPScore.
The sentence_similarity_inference.py
script can be used to re-compute Sieve scores using Sentence Transformer, CLIP Text Encoder or BLIP Text Encoder. The script processes a directory of parquet files generated by the webdataset_inference.py
script. To use the sentence_similarity_inference.py
script, --save_all_captions
must be passed to webdataset_inference.py
to save all generated captions so that we can re-compute the alignment scores using different encoders. This scripts also enables masking medium phrases.
- To compute alignment score in the embedding space of Sentence Transformer:
python sentence_similarity_inference.py \
--sentence_transformer_text_encoder \
--parquet_dir <PATH_TO_PARQUET_FILES>
Here, set --parquet_dir
to the directory containing parquet files generated by webdataset_inference.py
- To compute alignment score in the embedding space of CLIP text encoder:
python sentence_similarity_inference.py \
--clip_text_encoder \
--parquet_dir <PATH_TO_PARQUET_FILES>
- To compute alignment score in the embedding space of BLIP text encoder:
python sentence_similarity_inference.py \
--blip_text_encoder \
--parquet_dir <PATH_TO_PARQUET_FILES>
- To generate a dataset that keeps the top 20% of the image-text pairs based on Sieve scores:
python baselines.py --name generic_filter \
--metadata_dir <PATH_TO_PARQUET_FILES>\
--keys_to_fuse cap_scores \
--weights_per_key 1.0 \
--fraction 0.2 \
--save_path <NAME_OF_DATASET.npy>
Using --keys_to_fuse
you can choose which columns in the parquet files to use for pruning. Here, cap_scores is the column associated with Sieve scores in the parquet files.
This script will save two files; 1) .npy
file containing the unique sample ids, 2) .pkl
file saves keys associated with unique ids. This enables completely skipping the resharding of the data everytime you create a new dataset. We use the .pkl
in openclip dataloader to skip loading any samples not in the .pkl
file.
- To generate a dataset that keeps the top 20% of the image-text pairs based on CLIPScore scores:
python baselines.py --name generic_filter \
--metadata_dir <PATH_TO_PARQUET_FILES>\
--keys_to_fuse clip_scores \
--weights_per_key 1.0 \
--fraction 0.2 \
--save_path <NAME_OF_DATASET.npy>
- To generate a dataset that keeps the top 20% of the image-text pairs based fusing Sieve and CLIPScore as described in the paper:
python baselines.py --name generic_filter \
--metadata_dir \
--keys_to_fuse cap_scores clip_scores \
--weights_per_key 0.5 0.5 \
--fraction 0.2 \
--save_path <NAME_OF_DATASET.npy>
Here, each pruning signal is normalized independently and then fused with a weight of 0.5 on each signal. More generally, the generic_filter enables fusing an arbitrary number of pruning signals with weights defined by --weights_per_key
To prevent resharding the data using the generated filters. We have adapted openclip dataloader to use a dictionary passed to the training script using the subset_file argument to determine whether a sample should be included in training.
To pretrain clip model on the medium scale of datacomp using the generated .pkl
using multiple node run:
sbatch scripts/slurm_train.sh \
--data_dir <PATH_TO_SHARD_TAR_FILES> \
--subset_file <PATH_TO_PICKLE_FILE> \
--scale medium \
--seed 1
Here, subset_file is an argument that can be used to pass the generated .pkl
containing the keys of the selected samples.
To evaluate pretrained CLIP models on 38 downstream tasks:
sbatch scripts/slurm_eval.sh \
-f <PATH_TO_PRETRAINED_FOLDER>
The majority of Sieve is licensed under CC-BY-NC, however portions of the project are available under separate license terms: datacomp is licensed under the MIT license; open_clip is licensed under https://github.com/mlfoundations/open_clip/blob/main/LICENSE; BLIP is licensed under the BSD 3-clause license.
If you use Sieve, please use the below BibTex for citing: :
@article{Sieve,
title={Sieve: Multimodal Dataset Pruning Using Image Captioning Models},
author={Mahmoud, Anas and Elhoushi, Mostafa and Abbas, Amro and Yang, Yu and Ardalani, Newsha and Leather, Hugh and Morcos, Ari},
journal={CVPR},
year={2024}
}