Skip to content

Example Fondant pipeline to download and filter the fondant-cc-25m dataset

Notifications You must be signed in to change notification settings

ml6team/fondant-usecase-filter-creative-commons

Repository files navigation

Creative commons licensed data pipeline

Introduction

This repository contains a Fondant pipeline to load and filter the fondant-cc-25m dataset. This dataset contains more than 25 million images with a creative commons license, extracted from CommonCrawl.

You can either use the notebook to interactively build the pipeline, or follow along with the README below to use the CLI.

Pipeline overview

The primary goal of this sample is to showcase how you can use a Fondant pipeline and reusable components to load an image dataset from HuggingFace Hub and download all images. Pipeline Steps:

  • Load from Huggingface Hub: The pipeline begins by loading the image dataset from Huggingface Hub.
  • Download Images: The download image component download images and stores them to parquet.
  • Filter Images: The filter image component filters images based on their resolution.

Running the sample pipeline and explore the data

Accordingly, the getting started documentation, you can go to the src folder and run the pipeline by using the LocalRunner as follow:

fondant run local pipeline.py

Note: The 'load_from_hub' component accepts an argument that defines the dataset size. You have the option to adjust it to load more images from HuggingFace. Therefore, you can modify this line: "n_rows_to_load": 1000

After the pipeline is succeeded you can explore the data by using the fondant data explorer:

fondant explore --base_path ./data-dir