Name		Name	Last commit message	Last commit date
parent directory ..
assets		assets
examples		examples
README.md		README.md
data-format.md		data-format.md
deduplication.md		deduplication.md
develop.md		develop.md
getting-started.md		getting-started.md
mixer.md		mixer.md
parallel-processor.md		parallel-processor.md
taggers.md		taggers.md
tokenize.md		tokenize.md

README.md

Dolma Toolkit Documentation

The Dolma toolkit enables dataset curation for pretraining AI models. Reason to use the Dolma toolkit are:

High performance ⚡️ Dolma toolkit is designed to be highly performant, and can be used to process datasets with billions of documents in parallel.
Portable 🧳 Dolma toolkit can be run on a single machine, a cluster, or a cloud computing environment.
Built-in taggers 🏷 Dolma toolkit comes with a number of built-in taggers, including language detection, toxicity detection, perplexity scoring, and common filtering recipes, such as the ones used to create Gopher and C4.
Fast deduplication 🗑 Dolma toolkit can deduplicate documents using rust-based a Bloom filter, which is significantly faster than other methods.
Extensible 🧩 Dolma toolkit is designed to be extensible, and can be extended with custom taggers.
Cloud support ☁️ Dolma toolkit supports reading and write data from local disk, and AWS S3-compatible locations.

Dataset curation with the Dolma toolkit usually happens in four steps:

Using taggers, spans of documents in a dataset are tagged with properties (e.g. the language their are in, toxicity, etc);
Documents are optionally deduplicated based on their content or metadata;
Using the mixer, documents removed or filtered depending on value of attributes;
Finally, documents can be tokenized using any HuggingFace-compatible tokenizer.

Dolma toolkit can be installed using pip:

pip install dolma

Dolma toolkit can be used either as a Python library or as a command line tool. The command line tool can be accessed using the dolma command. To see the available commands, use the --help flag.

dolma --help

Index

To read Dolma toolkit's documentation, visit the following pages:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

assets

assets

examples

examples

README.md

README.md

data-format.md

data-format.md

deduplication.md

deduplication.md

develop.md

develop.md

getting-started.md

getting-started.md

mixer.md

mixer.md

parallel-processor.md

parallel-processor.md

taggers.md

taggers.md

tokenize.md

tokenize.md

README.md

Dolma Toolkit Documentation

Index

Files

docs

Directory actions

More options

Directory actions

More options

Latest commit

History

docs

Folders and files

parent directory

Dolma Toolkit Documentation

Index