Skip to content

IBM/data-prep-lab

Repository files navigation

Data Prep Lab

Status GitHub Issues GitHub Pull Requests


Data Prep Lab is a community project to democratize and accelerate unstructured data preparation for LLM app developers. With the explosive growth of LLM-enabled use cases, developers are faced with the enormous challenge of preparing use case-specific unstructured data to fine-tune or instruct-tune the LLMs. As the variety of use cases grows, so does the need to support:

  • New modalities of data (code, language, speech, visual)
  • New ways of transforming the data to optimize the performance of the resulting LLMs for each specific use case.
  • Large variety in the scale of data to be processed, from laptop-scale to datacenter-scale

Data Prep Lab offers implementations of commonly needed data transformations, called modules, for both Code and Language modalities. The goal is to offer high-level APIs for developers to quickly get started in working with their data, without needing expertise in the underlying runtimes and frameworks.

πŸ“ Table of Contents

πŸ“– About

Data Prep Lab is a toolkit for streamlining data preparation for developers looking to build LLM-enabled applications via fine-tuning or instruction-tuning. Data Prep Lab contributes a set of modules that the developer can get started with to easily build data pipelines suitable for their use case. These modules have been tested in producing pre-training datasets for the Granite open models.

The modules are built on common frameworks (for Spark and Ray), called the data processing library that allows the developers to build new custom modules that readily scale across a variety of runtimes. Eventually, Data Prep Lab will offer consistent APIs and configurations across the following underlying runtimes.

  1. Python runtime
  2. Ray runtime (local and distributed)
  3. Spark runtime (local and distributed)
  4. No-code pipelines with KFP (local and distributed, wrapping Ray)

The current matrix for the combination of modules and supported runtimes is shown in the table below. Contributors are welcome to add new modules as well as add runtime support for existing modules!

Modules Python-only Ray Spark KFP on Ray
No-op / template βœ… βœ… βœ…
Doc ID annotation βœ… βœ… βœ…
Programming language annnotation βœ… βœ… βœ…
Exact dedup filter βœ… βœ…
Fuzzy dedup filter βœ… βœ…
Code quality annotation βœ… βœ… βœ…
Malware annotation βœ… βœ… βœ…
Filter on annotations βœ… βœ… βœ… βœ…
Tokenize βœ… βœ… βœ…

Features of the toolkit:

  • Aiming to accelerate unstructured data prep burden for the "long tail" of LLM use cases
  • Growing set of module implementations across multiple runtimes and targeting laptop-scale to datacenter-scale processing
  • A growing set of sample pipelines developed for real enterprise use cases
  • Data processing library to enable contribution of new custom modules targeting new use cases
  • Kube Flow Pipelines-based workflow automation for no-code data prep

Data modalities supported:

  • Code - support for code datasets as downloaded .zip files of GitHub repositories converted to . parquet files.
  • Language - Future releases will provide transforms specific to natural language, and like code transforms will operate on parquet files.

Support for additional data modalities is expected in the future.

Data Processing Library:

Python-based library that has ready-to-use transforms that can be supported across a variety of runtimes. We use the popular point parquet format to store the data (code or language). Every parquet file follows a set schema. Data is converted from raw form (e.g., zip files for GitHub repositories) to parquet files by the ingest2parquet tool that also adds the necessary fields in the schema.
A user can use one or more of the available transforms to process their data.

Transform design:

A transform can follow one of the two patterns: annotator or filter.

  • Annotator An annotator transform adds information during the processing by adding one more column to the parquet file. The annotator design also allows a user to verify the results of the processing before the actual filtering of the data.

  • Filter A filter transform processes the data and outputs the transformed data, e.g., exact deduplication. A general purpose SQL-based filter transform enables a powerful mechanism for identifying columns and rows of interest for downstream processing. For a new module to be added, a user can pick the right design based on the processing to be applied. More details here.

Scaling of transforms:

To enable processing large volumes of data leveraging multi-mode clusters, Ray and Spark wrappers are provided, to readily scale out the Python implementations. A generalized workflow is shown here.

Bring Your Own Transform:

One can add new transforms by bringing in their own Python-based processing logic and using the Data Processing Library to build and contribute transforms. More details on the data processing library are here.

Automation:

The toolkit also supports transform execution automation based on Kubeflow pipelines(KFP) and tested on Kind cluster. KFP implementation is based on the KubeRay Operator for creating and managing the Ray cluster and KubeRay API server to interact with the KubeRay operator. An additional framework along with several kfp components is used to simplify pipeline implementation.

βš™ Setup

We tried the project on different hardware/software configurations(see Apple/Mac considerations.) We recommend using a laptop with at least 16GB of memory and 8 CPUs for development without KFP, and at least 32GB and preferably 16 CPUs if you plan to run KFP on Kind.

Prerequisites

Python 3.10 or 3.11 pre-commit twine Docker/Podman

Installation Steps

git clone git@github.com:IBM/data-prep-lab.git
cd data-prep-lab
pip install pre-commit
pip install twine
pre-commit install

Additionally, if you will be using local Minio for S3 testing you need to install Minio and mc. Refer to Minio install instructions for more details.

πŸš€ Getting Started

There are various entry points that one can choose based on their use case. Below are a few demos to get you started.

Run a single transform on local-ray

Get started by running the noop transform that performs an identity operation by following the tutorial and associated noop implementation.

Run a data pipeline on local-ray

Get started by building a data pipeline with our example pipeline (link to be added) that can run on a laptop.

Build your own sequence of transforms

Follow the documentation here to build your own pipelines.

Automate the pipeline

The data preprocessing can be automated by running transformers as a KubeFlow pipeline (KFP). See a simple transform pipeline tutorial. Next releases of Data Prep LAB will demonstrate how several simple transform pipelines can be combined into a single KFP pipeline. Future releases of Data Prep LAB will demonstrate how multiple simple transform pipelines can be combined into a single KFP pipeline.

The project facilitates the creation of a local Kind cluster with all the required software and test data. To work with the Kind cluster and KFP, you need to install several pre-required software packages. Please refer to Kind preinstalled software for more details

When you have all packages installed, you can execute

make setup

from this main package directory or from the kind directory.

When you finish working with the cluster, you can destroy it by

make clean

How to navigate and use the repository

See documentation on repository structure and its use

🀝 How to contribute

See contribution guide

⭐ Acknowledgements

Thanks to the BigCode Project that has been used to build the code quality module.