Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baseline data #61

Draft
wants to merge 69 commits into
base: main
Choose a base branch
from
Draft

Baseline data #61

wants to merge 69 commits into from

Conversation

IanMagnusson
Copy link
Contributor

@IanMagnusson IanMagnusson commented Oct 20, 2023

Working on creating data with dolma v1.5 style decontamination from baseline datasets. Progress so far is commented below.

@IanMagnusson
Copy link
Contributor Author

IanMagnusson commented Oct 20, 2023

To fix the issue with all the data getting removed by the decon we tried deleting the bloom filter in s3 before rerunning, as this is getting read in and added too rather than started fresh. It is unclear why this should change the filter (as the data it's being run on should be identical) unless something is causing the bloom filter indexing to shift such that the old filter is hashed differently.

aws s3 rm s3://ai2-llm/bloom-filters/perplexity-suite-v3_option2.bin

and then we tried rerunning everything after this step:

dolma -c configs/dolma-v1_5/decontamination/step1_4-create-bloom-filter/option2.yaml dedupe

However this still had the same issue of removing almost everything in the dedup.

@IanMagnusson
Copy link
Contributor Author

IanMagnusson commented Oct 20, 2023

Tried this approach again but restarting from the step below where the eval data that is used to build the bloom filter is created, after first removing the output directory for this in case something about how the bloom filter creation step adds attributes to this is a problem:

dolma -c configs/dolma-v1_5/decontamination/step1_3-make-eval-set/option2.yaml mix

Additionally we changed the bloom filter byte size in configs/baselines/decontamination/falcon-refinedweb.yaml to actually reflect the value reported during the bloom filter creation (ie size_in_bytes: 33554432).

After this I am unfortunately still seeing the behavior with nearly all files being removed.

@soldni soldni self-assigned this Oct 25, 2023
@IanMagnusson
Copy link
Contributor Author

I tried something I just thought of to get some more info on debugging the decon issues: I tried running the decon pipeline using a copy of saved bloom filter for option 1 that I hadn't accidentally over written. So this bloom filter should be created correctly. However when I run it on Falcon it starts removing almost all documents the same way as when I remade the bloom filter. So this implies to me that the issue isn't with the bloom filter creation but rather in how we're using it.

@soldni
Copy link
Member

soldni commented Oct 26, 2023

Issues should have been fixed with #66.

@IanMagnusson
Copy link
Contributor Author

IanMagnusson commented Oct 26, 2023

Starting over from the top now with new Dolma version (commit 2ee1ae2):

conda remove -n dolma-baselines --all
aws s3 rm s3://ai2-llm/bloom-filters/perplexity-suite-v3_option2.bin

Setup Environment

Create a conda environment with Python >= 3.8. In this case, we use Python 3.10 and use Anaconda to create the environment.

conda create -n dolma-baselines python=3.10

After creating the environment, activate it and install necessary tools using the included makefile.

conda activate dolma-baselines
make setup

and restart your shell. Finally, to begin development, install the repository in editable mode using maturin.

maturin build -r 
pip install target/wheels/dolma-0.9.0-*.whl

Decon

Follow the steps in this readme to decontaminate

Step 1.1: copy data locally

aws s3 sync s3://ai2-llm/eval-data/perplexity/v2_small $HOME/perplexity/v2_small/documents
aws s3 sync s3://ai2-llm/eval-data/perplexity/v3 $HOME/perplexity/v3/documents

Step 1.1b: change type of IDs in v3 subset (TEMPORARY FIX)

python configs/dolma-v1_5/decontamination/fix_ids_type.py ~/perplexity/*/*/*/*/*.gz

Step 1.2: tag out paragraphs by uniseg length

dolma tag --documents "${HOME}/perplexity/v2_small/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188
dolma tag --documents "${HOME}/perplexity/v3/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188

Step 1.3: filter out paragraphs that are too short

dolma -c configs/dolma-v1_5/decontamination/step1_3-make-eval-set/option2.yaml mix

Step 1.4: create bloom filter

dolma -c configs/dolma-v1_5/decontamination/step1_4-create-bloom-filter/option2.yaml dedupe

Now let's do this with Pile since we want to train on it first. So we mark contamination:

dolma -c configs/baselines/decontamination/pile.yaml dedupe

Then we remove contamination:

dolma -c configs/baselines/mixing/pile.json mix --processes 224

Unfortunately this still results in near total removal:

[2023-10-26T17:13:50Z INFO  dolma::shard] Dropped 1403904 of 1403954 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/08_0.json.gz              
[2023-10-26T17:13:52Z INFO  dolma::shard] Dropped 1404592 of 1404658 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/02_0.json.gz              
[2023-10-26T17:13:56Z INFO  dolma::shard] Dropped 1402981 of 1404511 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/23_4.json.gz              
[2023-10-26T17:13:57Z INFO  dolma::shard] Dropped 1403542 of 1403597 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/28_1.json.gz              
[2023-10-26T17:14:04Z INFO  dolma::shard] Dropped 1403859 of 1404028 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/21_3.json.gz 

Overall we have only 145725 / 210607728 = 0.0006919261766 of documents retained.

@IanMagnusson
Copy link
Contributor Author

IanMagnusson commented Oct 26, 2023

Okay I think the issue is that the old setup instructions had me installing the wrong wheels so here we go again but now with the right wheels.

Starting over from the top now with new Dolma version (commit 2ee1ae2):

conda remove -n dolma-baselines --all
aws s3 rm s3://ai2-llm/bloom-filters/perplexity-suite-v3_option2.bin
rm -r ~/perplexity/*
rm target/wheels/*
rm -r /mnt/tank/dolma_tmp/pile_*
aws s3 rm --recursive s3://ai2-llm/pretraining-data/sources/pile/v0/attributes/perplexity_suite_v3_option2/
aws s3 rm --recursive s3://ai2-llm/pretraining-data/sources/pile/v0_decon_ppl_suite_v3/

Setup Environment

Create a conda environment with Python >= 3.8. In this case, we use Python 3.10 and use Anaconda to create the environment.

conda create -n dolma-baselines python=3.10

After creating the environment, activate it and install necessary tools using the included makefile.

conda activate dolma-baselines
make setup

and restart your shell. Finally, to begin development, install the repository in editable mode using maturin.

maturin build -r 
pip install target/wheels/dolma-0.9.1-*.whl

Decon

Follow the steps in this readme to decontaminate

Step 1.1: copy data locally

aws s3 sync s3://ai2-llm/eval-data/perplexity/v2_small $HOME/perplexity/v2_small/documents
aws s3 sync s3://ai2-llm/eval-data/perplexity/v3 $HOME/perplexity/v3/documents

Step 1.1b: change type of IDs in v3 subset (TEMPORARY FIX)

python configs/dolma-v1_5/decontamination/fix_ids_type.py ~/perplexity/*/*/*/*/*.gz

Step 1.2: tag out paragraphs by uniseg length

dolma tag --documents "${HOME}/perplexity/v2_small/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188
dolma tag --documents "${HOME}/perplexity/v3/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188

Step 1.3: filter out paragraphs that are too short

dolma -c configs/dolma-v1_5/decontamination/step1_3-make-eval-set/option2.yaml mix

Step 1.4: create bloom filter

dolma -c configs/dolma-v1_5/decontamination/step1_4-create-bloom-filter/option2.yaml dedupe

Now let's do this with Pile since we want to train on it first. So we mark contamination:

dolma -c configs/baselines/decontamination/pile.yaml dedupe

Then we remove contamination:

dolma -c configs/baselines/mixing/pile.json mix --processes 224

This initially errored out like this:

[2023-10-26T18:24:37Z INFO  dolma::shard] Dropped 38520 of 1404145 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/07_0.json.gz                
[2023-10-26T18:30:51Z ERROR dolma::mixer] 1 shards failed to process.                                                                                                       
Traceback (most recent call last):                                                                                                                                          
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/__init__.py", line 25, in mixer                                                       
    _dolma.mixer_entrypoint(json.dumps(config))                                                                                                                             
RuntimeError: Failed with 1 errors

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ianm/miniconda3/envs/dolma-baselines/bin/dolma", line 8, in <module>
    sys.exit(main())
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/__main__.py", line 67, in main
    AVAILABLE_COMMANDS[args.__dict__.pop("command")].run_from_args(args=args, config=config)
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/__init__.py", line 182, in run_from_args
    return cls.run(parsed_config)
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/mixer.py", line 141, in run
    mixer(dict_config)
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/__init__.py", line 27, in mixer
    raise DolmaRustPipelineError(f"Error running mixer: {e}") from e
dolma.core.errors.DolmaRustPipelineError: Error running mixer: Failed with 1 errors

Rerunning the command didn't seem to reuse any of the already completed results, but it did finish without errors this time.

Removal is more moderate this time, though surprisingly consistent from file to file:

[2023-10-26T18:42:13Z INFO  dolma::shard] Dropped 38466 of 1402989 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/11_2.json.gz
[2023-10-26T18:42:16Z INFO  dolma::shard] Dropped 38337 of 1403669 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/13_3.json.gz
[2023-10-26T18:42:17Z INFO  dolma::shard] Dropped 38748 of 1404080 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/03_1.json.gz
[2023-10-26T18:42:17Z INFO  dolma::shard] Dropped 38472 of 1403675 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/12_4.json.gz
[2023-10-26T18:42:18Z INFO  dolma::shard] Dropped 38918 of 1403475 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/15_1.json.gz
[2023-10-26T18:42:18Z INFO  dolma::shard] Dropped 38708 of 1404626 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/10_4.json.gz
[2023-10-26T18:42:20Z INFO  dolma::shard] Dropped 38391 of 1403446 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/05_2.json.gz
[2023-10-26T18:42:21Z INFO  dolma::shard] Dropped 38592 of 1404508 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/23_3.json.gz
[2023-10-26T18:42:21Z INFO  dolma::shard] Dropped 38782 of 1404000 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/16_2.json.gz
[2023-10-26T18:42:30Z INFO  dolma::shard] Dropped 38647 of 1402989 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/11_3.json.gz

Overall we have only 204809882 / 210607728 = 0.9724708772 of documents retained.

@IanMagnusson
Copy link
Contributor Author

IanMagnusson commented Oct 26, 2023

Next we're trying to tokenize

dolma tokens --documents "/mnt/tank/dolma_tmp/results/pile/v0_decon_ppl_suite_v3/*.json.gz" --destination /mnt/tank/dolma_tmp/preprocessed/pile/v0_decon_ppl_suite_v3/gpt-neox-20b-pii-special --tokenizer_name_or_path allenai/eleuther-ai-gpt-neox-20b-pii-special --processes 224 --seed 3920

But this gets the following error:

Traceback (most recent call last):
  File "/home/ianm/miniconda3/envs/dolma-baselines/bin/dolma", line 8, in <module>
    sys.exit(main())
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/__main__.py", line 67, in main
    AVAILABLE_COMMANDS[args.__dict__.pop("command")].run_from_args(args=args, config=config)
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/__init__.py", line 182, in run_from_args
    return cls.run(parsed_config)
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/tokenizer.py", line 103, in run
    tokenize_in_parallel(
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/tokenizer/executor.py", line 191, in tokenize_in_parallel
    multiprocessing.set_start_method("spawn")
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/multiprocessing/context.py", line 247, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

Luca says to just remove the offending line. So we rebuild after removing:

dolma/tokenizer/executor.py", line 191, in tokenize_in_parallel
    multiprocessing.set_start_method("spawn")

Rebuild env

conda create -n dolma-baselines-fixed python=3.10
conda activate dolma-baselines-fixed
rm target/wheels/dolma-0.9.1-*.whl
maturin build -r 
pip install target/wheels/dolma-0.9.1-*.whl

Then try again:

dolma tokens --documents "/mnt/tank/dolma_tmp/results/pile/v0_decon_ppl_suite_v3/*.json.gz" --destination /mnt/tank/dolma_tmp/preprocessed/pile/v0_decon_ppl_suite_v3/gpt-neox-20b-pii-special --tokenizer_name_or_path allenai/eleuther-ai-gpt-neox-20b-pii-special --processes 224 --seed 3920

This works and we upload the results to s3://ai2-llm/preprocessed/pile/v0_decon_ppl_suite_v3/gpt-neox-20b-pii-special/

@IanMagnusson
Copy link
Contributor Author

Now applying all this to RedPajama we get:

parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/dolma_tmp/results/redpajama/v1/attributes/perplexity_suite_v3_option2/split=train/dataset=*/*.gz | wc -l
parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/dolma_tmp/results/redpajama/v1/attributes/perplexity_suite_v3_option2/split=train/dataset=*/*.gz | awk '{sum += $1} END {print sum}'

900799243 / 901687943 = 0.999014404 documents retained

And tokenize

dolma -c configs/baselines/tokenization/redpajama.yaml tokens

@IanMagnusson
Copy link
Contributor Author

IanMagnusson commented Nov 8, 2023

And now falcon:

decon

dolma -c configs/baselines/decontamination/falcon-refinedweb.yaml dedupe

mix

dolma -c configs/baselines/mixing/falcon-refinedweb.json mix --processes 224

check doc removal

aws s3 sync s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0-0.05-heldout-complement/ /mnt/tank/dolma_tmp/results/falcon-refinedweb/v0-0.05-heldout-complement/ 

parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/dolma_tmp/results/falcon-refinedweb/v0-0.05-heldout-complement/attributes/perplexity_suite_v3_option2/*.gz | wc -l

parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/dolma_tmp/results/falcon-refinedweb/v0-0.05-heldout-complement/attributes/perplexity_suite_v3_option2/*.gz | awk '{sum += $1} END {print sum}'

912114192 / 918848690 = 0.9926707214 docs retained

Tokenize

dolma -c configs/baselines/tokenization/falcon-refinedweb.yaml tokens

@IanMagnusson
Copy link
Contributor Author

IanMagnusson commented Nov 18, 2023

We're redoing Pile tokenization now cuz of a bug when tokenizing with more parallel processes than files in the dataset. We push a new config and run:

dolma -c configs/baselines/tokenization/pile.yaml tokens

resulting in:

dolma -c configs/baselines/tokenization/pile.yaml tokens
batch_size: 10000
debug: false
destination: s3://ai2-llm/preprocessed/pile/v0_decon_ppl_suite_v3_fixed/gpt-neox-20b-pii-special
documents:
- s3://ai2-llm/pretraining-data/sources/pile/v0_decon_ppl_suite_v3/*.json.gz
dryrun: false
dtype: uint16
files_per_process: null
max_size: 1073741824
processes: 150
ring_size: 8
seed: 3920
tokenizer_name_or_path: allenai/eleuther-ai-gpt-neox-20b-pii-special
work_dir:
  input: /mnt/tank/tmp/pile_v0_decon_ppl_suite_v3_fixed_input
  output: /mnt/tank/tmp/pile_v0_decon_ppl_suite_v3_fixed_output
memmaps: 300m [2:45:43, 33.1s/m]
tokens: 307Gt [2:45:43, 30.9Mt/s]s]  
documents: 205Md [2:45:43, 20.6kd/s]
files: 150f [2:45:43, 66.3s/f]] 

@IanMagnusson
Copy link
Contributor Author

IanMagnusson commented Nov 18, 2023

Now lets do c4:

conda activate dolma-baselines-fixed
export TMPDIR=/mnt/tank/tmp/

This data is already deconned for Dolma, so we go right to check removal

aws s3 sync s3://ai2-llm/pretraining-data/sources/c4/v0/attributes/perplexity_suite_v3_option2/ /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2/ 

parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2/train/*.gz | wc -l

parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2/train/*.gz | awk '{sum += $1} END {print sum}'

364156258 / 364156258 = 100% documents retained

This seems unlikely so we try deconning again.

dolma -c configs/baselines/decontamination/c4.yaml dedupe

check again:

aws s3 sync s3://ai2-llm/pretraining-data/sources/c4/v0/attributes/perplexity_suite_v3_option2_redo/ /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2_redo/ 

parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2_redo/train/*.gz | wc -l

parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2_redo/train/*.gz | awk '{sum += $1} END {print sum}'

364121142 / 364156258 = 0.9999035689 doc retention rate

Mix

dolma -c configs/baselines/mixing/c4.json mix --processes 224

check the number of files to make sure its > 224 (cpus on this machine)

aws s3 ls s3://ai2-llm/pretraining-data/sources/c4/v0_decon_ppl_suite_v3/ | grep .json.gz | wc -l

496 files

Tokenize

dolma -c configs/baselines/tokenization/c4.yaml tokens
batch_size: 10000
debug: false
destination: s3://ai2-llm/preprocessed/c4/v0_decon_ppl_suite_v3/gpt-neox-20b-pii-special
documents:
- s3://ai2-llm/pretraining-data/sources/c4/v0_decon_ppl_suite_v3/*.json.gz
dryrun: false
dtype: uint16
files_per_process: null
max_size: 1073741824
processes: 224
ring_size: 8
seed: 3920
tokenizer_name_or_path: allenai/eleuther-ai-gpt-neox-20b-pii-special
work_dir:
  input: /mnt/tank/dolma_tmp/c4_input_tokenized
  output: /mnt/tank/dolma_tmp/c4_output_tokenized
memmaps: 233m [1:17:23, 19.9s/m]
tokens: 174Gt [1:17:23, 37.6Mt/s]]  
documents: 364Md [1:17:23, 78.4kd/s]
files: 496f [1:17:23, 9.36s/f]m]

@IanMagnusson
Copy link
Contributor Author

IanMagnusson commented Nov 19, 2023

Now mc4:

conda activate dolma-baselines-fixed
export TMPDIR=/mnt/tank/tmp/

dedup

dolma -c configs/baselines/decontamination/mc4.yaml dedupe

Check removal

parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/ai2-llm/pretraining-data/sources/mc4/en-wimbd-splits/attributes/perplexity_suite_v3_option2/train/*.gz | wc -l

parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/ai2-llm/pretraining-data/sources/mc4/en-wimbd-splits/attributes/perplexity_suite_v3_option2/train/*.gz  | awk '{sum += $1} END {print sum}'

3928652800 / 3928733374 = 0.9999794911

Mix

dolma -c configs/baselines/mixing/mc4.json mix --processes 224

Tokenize

dolma -c configs/baselines/tokenization/mc4.yaml tokens

@IanMagnusson
Copy link
Contributor Author

IanMagnusson commented Dec 1, 2023

Now we'll make a dolma-cc-only dataset. This just needs tokenization but for some reason it needs the code at main on afab18c

conda create -n dolma-main-latest python=3.10
conda activate dolma-main-latest
mv target/wheels/ target/wheels_bak
make setup
maturin build -r
pip install target/wheels/dolma-0.9.2-cp310-cp310-manylinux_2_34_x86_64.whl

Then tokenize:

dolma -c configs/baselines/tokenization/dolma_v1_5_cc_only.yaml tokens

Base automatically changed from soldni/warc to main May 2, 2024 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants