Skip to content

ml6team/fondant-clip-index

Repository files navigation

Building a Datacomp CLIP index with Fondant

Production-ready data processing made easy and shareable
Explore the Fondant docs »

Discord License

Introduction

This repository contains the code to build a CLIP index for the Datacomp-12.8M dataset with Fondant. It should be straightforward to apply it to a different dataset.

The resulting embedded dataset and index have been published on the Hugging Face Hub here. The data repository is structured as follows:

  • data/: The dataset containing ids, urls, and CLIP embeddings
  • faiss: The faiss index
  • id_mapping/: The mapping of the faiss ids to the original urls

Continue reading below to learn:

Why a CLIP index?

Large (image) datasets are often unwieldy to use due to their sheer size. Assume for instance that we would like to extract all the cat images from such a dataset. We would have to look at every image to classify if it's a cat image or not. And if we want to extract all the dog images next, we again need to look at every image.

Instead, we can look at every image once, and calculate a (CLIP) embedding representing its content. Combining these embeddings into an index, we can efficiently search through the dataset with a query, finding specific images, without having to look at each one.

CLIP index

This is what LAION did for their LAION-5b dataset, which made it possible to use, like we did in our ControlNet example. Unfortunately, the LAION-5b dataset and index have been taken offline (temporarily) and there aren't any alternatives. This is why we built an index for the Datacomp-12M dataset. While it is a lot smaller than LAION-5b, it should already enable a lot of use cases again, and can hopefully be the start towards building indices for more and larger datasets.

Creating the index

We leveraged Fondant to generate the CLIP index and published the pipeline in this git repository. You can find it in pipeline.py. The pipeline consists of 4 steps:

  • A load_from_hf_hub operation that loads the datacomp_small dataset from huggingface into the Fondant workspace and format.
  • A download_images operation which downloads the actual images from the urls in the dataset.
  • A embed_images operation which embeds the downloaded images using a CLIP model.
  • A write_to_file operation which writes the original urls and generated embeddings to the chosen destination.

You can run it by installing fondant:

pip install fondant==0.11.0

and running it with your runner of choice:

fondant run <runner> pipeline.py

Check the fondant documentation for more info.

After running the pipeline, we used autofaiss to build the CLIP index. You can use the included wrapper script build_index.py.

Once you have created the index, you can explore your index and validate that everything is working using the exploration.ipynb notebook.

Using the index

With Fondant

The easiest way to use the index, is using Fondant. Fondant offers reusable operations which allow you to query the index with your data:

To see how it can be used in an end-to-end example, check our ControlNet example which uses the index to create a dataset to fine-tune a ControlNet model on a specific domain.

With Clip-Retrieval

There are other open source tools which allow you to leverage a CLIP index. We can recommend clip-retrieval which lets you set up a service hosting the index accessible by API.

Execution details

For the execution details of our 12.8M run, check the announcement.

What's next

Making data building collaborative

With Fondant we aim to make data building collaborative, and we will share more features built on top of the Datacomp datasets to showcase this in the future. To stay up to date, join our Discord.

Larger datasets

Based on the popularity and feedback we receive on this 12.8M index, we might generate a CLIP index for the datacomp-128M dataset. If there are other datasets you are interested in, or want to generate an index for a different dataset yourself, please let us know in our Discord.

About

Create a CLIP index for an image dataset with Fondant

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published