Background

This repo describes deduplication Redpajama 2 using Lumi supercomputer
Dataset comes in "raw" form, meaning that the user must do the quality filtering and deduplication self
1. Download documents, Bloom filter (exact) duplicate IDs, Minhash signatures, and quality signals
2. Filter exact duplicates
3. Filter low-quality documents with quality measures
4. Fuzzy deduplication with MinhashLSH

Pipeline for full data

There have been several changes during the process, and this is a genuine description of the full process

Create directory structure

Run get_urls.sh →

── full_data
├── crawl_number_1
    - {crawl_number}-document-urls.txt    
    - {crawl_number}-minhash-urls.txt  
    - {crawl_number}-duplicates-urls.txt  
    - {crawl_number}-quality_signals-urls.txt
    ├──document

    ├──duplicates

    ├──quality_signals
...
├── crawl_number_n
    -...

Download and combine data

download_and_combine_all_sbatch.sh

Download all datatypes
Add missing IDs to documents
Combine different dtype files by language
After combination, remove sharded source files

→

── full_data
├── crawl_number_1
    - {crawl_number}-document-urls.txt    
    ...
    - {crawl_number}-quality_signals-urls.txt
    ├──combined
         -en_document.jsonl.gz
        - en_duplicates.parquet
        - en_minhash.parquet
        - en_quality_signals.jsonl.gz
        ...
        - it_...
...
├── crawl_number_n
    -...

Remove bloom filter duplicates

exact_dedup_all_sbatch.sh

writing large parquet files was too much for Lustre → quality signal files were not filtered
after exact dedup, gunzip all JSON files (this was also done separately to other JSONs as I realized the project was running out of memory)
→

 ── full_data
  ├── crawl_number_1
      - {crawl_number}-document-urls.txt    
      ...
      - {crawl_number}-quality_signals-urls.txt
      ├──combined
           - en_document.jsonl.gz
          ...
          - it_...
      ├──exact_deduplicated
          - en_document_exact_dedup.jsonl.gz
          - en_minhash_exact_dedup.parquet
          ...
          - it_...
  ...
  ├── crawl_number_n
      -...

Quality filter

Pre-computed metrics (number_of_words, number_of_lines, number_of_characters, language_identification, perplexity, stop_words, special_characters, flagged_words, words_per_line_mean, short_line_ratio, character_repetition10gram, character_repetition5gram, word_repetition, unigram_entropy, lines_end_in_punctuation) were used
Filtering was based on 4 different thresholds based in p-quantiles inspired by Cultura X:
- 0.05% of the data was used to calculate the distributions for German, French, Italian, and Spanish
- 0.02% of data was used to calculate the distribution for English
- lower p-th percentile was used for the metrics that favor high values (e.g. number_of_words), while metrics favoring low values (e.g., flagged words) will the upper p-th percentile
Regular
- $p_1 = 10$
- $p_3 = 90$
Strict
- $p_1=20$
- $p_3 =80$
Even stricter
- $p_1=30$
- $p_3=70$
Strictest
- $p_1 = 40$
- $p_3 = 60$
Of these documents and signatures were filtered using:
- Regular for German, French, Italian, and Spanish
- Strict for English
- These filters were chosen based on the goal of 100B tokens for each language
Full code for quality filtering is available here)

Minhash deduplication
- English strict filter returned about 1.8B signatures → would need 7.2TB mem
- Other languages returned about 600M signatures → would need about 2.5TB mem
- LUMI has 8 4TB largemem nodes, so almost all signatures would have fit into memory → however, largemem nodes have 24-hour time limit, so jobs would not have finished
- All languages EXCEPT Italian were downsampled to about 500M signatures → this number was chosen as it produces about 100B tokens per language
  - downsample_parquet.py
- Based on partial corpus deduplication tests, the initial estimate of computation time was about 30h, and following was done:
  - After downsampling, signatures were shared to 127 shards to get better parallelization
    - shard_parquet.py
  - Next, data deduplicated in 200GB shards to fit the data into smaller slurm partition
    - minhashlsh_partial.py
  - Next, 200GB shards were combined and shared again into 127 shards and fed into full cross-crawl fuzzy dedup
    - minhashlsh.py
  - Jobs were launched using fuzzy_dedup_job_constructor.py
- The partial dedup effort was not enough → there were still too many signatures and duplicates to complete the jobs in Lumi small partition time limit
- Next, all languages were iteratively dedupped multiple times to reduce the number of duplicates
  - 3 rounds of $\frac{1}{8},\frac{1}{4},\frac{1}{2}$-corpus dedups
    - shuffle_dataset.py to make dedups more effective
    - after shuffling, the file was sharded
  - the effort reduced the corpus size by about ~40%
  - this allowed full deduplication of the target
  - minhashlsh_partial.py
  - minhashlsh_partial_round_2.py → same module but done because of bad design of first
  - Jobs were launched using fuzzy_dedup_job_constructor_partial.py
  - Used 0.2-6 hours per shard with 3-16 CPUs
- Finally, the duplicates were filtered with filter_fuzzy_duplicates.py
  - Used ~30 hours with 32 CPUs
  - Reduction from initial size was about 50%

Singularity

Dedup/combination can be run with /scratch/project_462000353/akselir/containers/preprocessing_container.sif

Resource requirements

Disk and inode requirements

Whole data about 250TB in compressed format, 16.8M inodes
500TB disk and 50M inodes should be enough to process everything at the same time
We got only 5M inodes, so data needed to be combined before any further processing and later compressed again
While holding intermediate files, the total disk consumed 369TB

Computing

Total (including debugging and failed runs), about 250K/CPUh was used for the effort

References

   @software{chenghao_mou_2023_8364980,
   author       = {Chenghao Mou and
                 Chris Ha and
                 Kenneth Enevoldsen and
                 Peiyuan Liu},
   title        = {ChenghaoMou/text-dedup: Reference Snapshot},
   month        = sep,
   year         = 2023,
   publisher    = {Zenodo},
   version      = {2023.09.20},
   doi          = {10.5281/zenodo.8364980},
   url          = {https://doi.org/10.5281/zenodo.8364980}
   }

   @article{Nguyen_Van Nguyen_Lai_Man_Ngo_Dernoncourt_Rossi_Nguyen_2023, title={CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages}, DOI={10.48550/arxiv.2309.09400}, journal={arXiv}, author={Nguyen, Thuat and Van Nguyen, Chien and Lai, Viet Dac and Man, Hieu and Ngo, Nghia Trung and Dernoncourt, Franck and Rossi, Ryan A. and Nguyen, Thien Huu}, year={2023}

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
scripts		scripts
src		src
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

src

src

.gitignore

.gitignore

LICENCE

LICENCE

README.md

README.md

Repository files navigation

Background

Pipeline for full data

Singularity

Resource requirements

Disk and inode requirements

Computing

References

About

Releases

Packages

Languages

License

LumiOpen/redpajama-v2-processing

Folders and files

Latest commit

History

Repository files navigation

Background

Pipeline for full data

Singularity

Resource requirements

Disk and inode requirements

Computing

References

About

Resources

License

Stars

Watchers

Forks

Languages