Baseline data #61

IanMagnusson · 2023-10-20T01:25:38Z

Working on creating data with dolma v1.5 style decontamination from baseline datasets. Progress so far is commented below.

ff

IanMagnusson · 2023-10-20T20:50:26Z

To fix the issue with all the data getting removed by the decon we tried deleting the bloom filter in s3 before rerunning, as this is getting read in and added too rather than started fresh. It is unclear why this should change the filter (as the data it's being run on should be identical) unless something is causing the bloom filter indexing to shift such that the old filter is hashed differently.

aws s3 rm s3://ai2-llm/bloom-filters/perplexity-suite-v3_option2.bin

and then we tried rerunning everything after this step:

dolma -c configs/dolma-v1_5/decontamination/step1_4-create-bloom-filter/option2.yaml dedupe

However this still had the same issue of removing almost everything in the dedup.

IanMagnusson · 2023-10-20T23:59:19Z

Tried this approach again but restarting from the step below where the eval data that is used to build the bloom filter is created, after first removing the output directory for this in case something about how the bloom filter creation step adds attributes to this is a problem:

dolma -c configs/dolma-v1_5/decontamination/step1_3-make-eval-set/option2.yaml mix

Additionally we changed the bloom filter byte size in configs/baselines/decontamination/falcon-refinedweb.yaml to actually reflect the value reported during the bloom filter creation (ie size_in_bytes: 33554432).

After this I am unfortunately still seeing the behavior with nearly all files being removed.

IanMagnusson · 2023-10-25T22:23:18Z

I tried something I just thought of to get some more info on debugging the decon issues: I tried running the decon pipeline using a copy of saved bloom filter for option 1 that I hadn't accidentally over written. So this bloom filter should be created correctly. However when I run it on Falcon it starts removing almost all documents the same way as when I remade the bloom filter. So this implies to me that the issue isn't with the bloom filter creation but rather in how we're using it.

soldni · 2023-10-26T04:49:19Z

Issues should have been fixed with #66.

IanMagnusson · 2023-10-26T05:47:13Z

Starting over from the top now with new Dolma version (commit 2ee1ae2):

conda remove -n dolma-baselines --all
aws s3 rm s3://ai2-llm/bloom-filters/perplexity-suite-v3_option2.bin

Setup Environment

Create a conda environment with Python >= 3.8. In this case, we use Python 3.10 and use Anaconda to create the environment.
conda create -n dolma-baselines python=3.10
After creating the environment, activate it and install necessary tools using the included makefile.
conda activate dolma-baselines
make setup
and restart your shell. Finally, to begin development, install the repository in editable mode using maturin.
maturin build -r 
pip install target/wheels/dolma-0.9.0-*.whl

Decon

Follow the steps in this readme to decontaminate

Step 1.1: copy data locally

aws s3 sync s3://ai2-llm/eval-data/perplexity/v2_small $HOME/perplexity/v2_small/documents
aws s3 sync s3://ai2-llm/eval-data/perplexity/v3 $HOME/perplexity/v3/documents

Step 1.1b: change type of IDs in v3 subset (TEMPORARY FIX)

python configs/dolma-v1_5/decontamination/fix_ids_type.py ~/perplexity/*/*/*/*/*.gz

Step 1.2: tag out paragraphs by uniseg length

dolma tag --documents "${HOME}/perplexity/v2_small/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188
dolma tag --documents "${HOME}/perplexity/v3/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188

Step 1.3: filter out paragraphs that are too short

dolma -c configs/dolma-v1_5/decontamination/step1_3-make-eval-set/option2.yaml mix

Step 1.4: create bloom filter

dolma -c configs/dolma-v1_5/decontamination/step1_4-create-bloom-filter/option2.yaml dedupe

Now let's do this with Pile since we want to train on it first. So we mark contamination:

dolma -c configs/baselines/decontamination/pile.yaml dedupe

Then we remove contamination:

dolma -c configs/baselines/mixing/pile.json mix --processes 224

Unfortunately this still results in near total removal:

[2023-10-26T17:13:50Z INFO  dolma::shard] Dropped 1403904 of 1403954 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/08_0.json.gz              
[2023-10-26T17:13:52Z INFO  dolma::shard] Dropped 1404592 of 1404658 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/02_0.json.gz              
[2023-10-26T17:13:56Z INFO  dolma::shard] Dropped 1402981 of 1404511 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/23_4.json.gz              
[2023-10-26T17:13:57Z INFO  dolma::shard] Dropped 1403542 of 1403597 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/28_1.json.gz              
[2023-10-26T17:14:04Z INFO  dolma::shard] Dropped 1403859 of 1404028 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/21_3.json.gz

Overall we have only 145725 / 210607728 = 0.0006919261766 of documents retained.

IanMagnusson · 2023-10-26T21:42:58Z

Okay I think the issue is that the old setup instructions had me installing the wrong wheels so here we go again but now with the right wheels.

Starting over from the top now with new Dolma version (commit 2ee1ae2):

conda remove -n dolma-baselines --all
aws s3 rm s3://ai2-llm/bloom-filters/perplexity-suite-v3_option2.bin
rm -r ~/perplexity/*
rm target/wheels/*
rm -r /mnt/tank/dolma_tmp/pile_*
aws s3 rm --recursive s3://ai2-llm/pretraining-data/sources/pile/v0/attributes/perplexity_suite_v3_option2/
aws s3 rm --recursive s3://ai2-llm/pretraining-data/sources/pile/v0_decon_ppl_suite_v3/

Setup Environment

Create a conda environment with Python >= 3.8. In this case, we use Python 3.10 and use Anaconda to create the environment.
conda create -n dolma-baselines python=3.10
After creating the environment, activate it and install necessary tools using the included makefile.
conda activate dolma-baselines
make setup
and restart your shell. Finally, to begin development, install the repository in editable mode using maturin.
maturin build -r 
pip install target/wheels/dolma-0.9.1-*.whl

Decon

Follow the steps in this readme to decontaminate

Step 1.1: copy data locally

aws s3 sync s3://ai2-llm/eval-data/perplexity/v2_small $HOME/perplexity/v2_small/documents
aws s3 sync s3://ai2-llm/eval-data/perplexity/v3 $HOME/perplexity/v3/documents

Step 1.1b: change type of IDs in v3 subset (TEMPORARY FIX)

python configs/dolma-v1_5/decontamination/fix_ids_type.py ~/perplexity/*/*/*/*/*.gz

Step 1.2: tag out paragraphs by uniseg length

dolma tag --documents "${HOME}/perplexity/v2_small/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188
dolma tag --documents "${HOME}/perplexity/v3/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188

Step 1.3: filter out paragraphs that are too short

dolma -c configs/dolma-v1_5/decontamination/step1_3-make-eval-set/option2.yaml mix

Step 1.4: create bloom filter

dolma -c configs/dolma-v1_5/decontamination/step1_4-create-bloom-filter/option2.yaml dedupe

Now let's do this with Pile since we want to train on it first. So we mark contamination:

dolma -c configs/baselines/decontamination/pile.yaml dedupe

Then we remove contamination:

dolma -c configs/baselines/mixing/pile.json mix --processes 224

This initially errored out like this:

[2023-10-26T18:24:37Z INFO  dolma::shard] Dropped 38520 of 1404145 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/07_0.json.gz                
[2023-10-26T18:30:51Z ERROR dolma::mixer] 1 shards failed to process.                                                                                                       
Traceback (most recent call last):                                                                                                                                          
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/__init__.py", line 25, in mixer                                                       
    _dolma.mixer_entrypoint(json.dumps(config))                                                                                                                             
RuntimeError: Failed with 1 errors

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ianm/miniconda3/envs/dolma-baselines/bin/dolma", line 8, in <module>
    sys.exit(main())
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/__main__.py", line 67, in main
    AVAILABLE_COMMANDS[args.__dict__.pop("command")].run_from_args(args=args, config=config)
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/__init__.py", line 182, in run_from_args
    return cls.run(parsed_config)
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/mixer.py", line 141, in run
    mixer(dict_config)
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/__init__.py", line 27, in mixer
    raise DolmaRustPipelineError(f"Error running mixer: {e}") from e
dolma.core.errors.DolmaRustPipelineError: Error running mixer: Failed with 1 errors

Rerunning the command didn't seem to reuse any of the already completed results, but it did finish without errors this time.

Removal is more moderate this time, though surprisingly consistent from file to file:

[2023-10-26T18:42:13Z INFO  dolma::shard] Dropped 38466 of 1402989 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/11_2.json.gz
[2023-10-26T18:42:16Z INFO  dolma::shard] Dropped 38337 of 1403669 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/13_3.json.gz
[2023-10-26T18:42:17Z INFO  dolma::shard] Dropped 38748 of 1404080 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/03_1.json.gz
[2023-10-26T18:42:17Z INFO  dolma::shard] Dropped 38472 of 1403675 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/12_4.json.gz
[2023-10-26T18:42:18Z INFO  dolma::shard] Dropped 38918 of 1403475 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/15_1.json.gz
[2023-10-26T18:42:18Z INFO  dolma::shard] Dropped 38708 of 1404626 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/10_4.json.gz
[2023-10-26T18:42:20Z INFO  dolma::shard] Dropped 38391 of 1403446 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/05_2.json.gz
[2023-10-26T18:42:21Z INFO  dolma::shard] Dropped 38592 of 1404508 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/23_3.json.gz
[2023-10-26T18:42:21Z INFO  dolma::shard] Dropped 38782 of 1404000 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/16_2.json.gz
[2023-10-26T18:42:30Z INFO  dolma::shard] Dropped 38647 of 1402989 documents from s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/11_3.json.gz

Overall we have only 204809882 / 210607728 = 0.9724708772 of documents retained.

IanMagnusson · 2023-10-26T21:50:10Z

Next we're trying to tokenize

dolma tokens --documents "/mnt/tank/dolma_tmp/results/pile/v0_decon_ppl_suite_v3/*.json.gz" --destination /mnt/tank/dolma_tmp/preprocessed/pile/v0_decon_ppl_suite_v3/gpt-neox-20b-pii-special --tokenizer_name_or_path allenai/eleuther-ai-gpt-neox-20b-pii-special --processes 224 --seed 3920

But this gets the following error:

Traceback (most recent call last):
  File "/home/ianm/miniconda3/envs/dolma-baselines/bin/dolma", line 8, in <module>
    sys.exit(main())
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/__main__.py", line 67, in main
    AVAILABLE_COMMANDS[args.__dict__.pop("command")].run_from_args(args=args, config=config)
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/__init__.py", line 182, in run_from_args
    return cls.run(parsed_config)
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/cli/tokenizer.py", line 103, in run
    tokenize_in_parallel(
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/site-packages/dolma/tokenizer/executor.py", line 191, in tokenize_in_parallel
    multiprocessing.set_start_method("spawn")
  File "/home/ianm/miniconda3/envs/dolma-baselines/lib/python3.10/multiprocessing/context.py", line 247, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

Luca says to just remove the offending line. So we rebuild after removing:

dolma/tokenizer/executor.py", line 191, in tokenize_in_parallel
    multiprocessing.set_start_method("spawn")

Rebuild env

conda create -n dolma-baselines-fixed python=3.10
conda activate dolma-baselines-fixed
rm target/wheels/dolma-0.9.1-*.whl
maturin build -r 
pip install target/wheels/dolma-0.9.1-*.whl

Then try again:

dolma tokens --documents "/mnt/tank/dolma_tmp/results/pile/v0_decon_ppl_suite_v3/*.json.gz" --destination /mnt/tank/dolma_tmp/preprocessed/pile/v0_decon_ppl_suite_v3/gpt-neox-20b-pii-special --tokenizer_name_or_path allenai/eleuther-ai-gpt-neox-20b-pii-special --processes 224 --seed 3920

This works and we upload the results to s3://ai2-llm/preprocessed/pile/v0_decon_ppl_suite_v3/gpt-neox-20b-pii-special/

IanMagnusson · 2023-10-27T20:08:00Z

Now applying all this to RedPajama we get:

parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/dolma_tmp/results/redpajama/v1/attributes/perplexity_suite_v3_option2/split=train/dataset=*/*.gz | wc -l
parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/dolma_tmp/results/redpajama/v1/attributes/perplexity_suite_v3_option2/split=train/dataset=*/*.gz | awk '{sum += $1} END {print sum}'

900799243 / 901687943 = 0.999014404 documents retained

And tokenize

dolma -c configs/baselines/tokenization/redpajama.yaml tokens

IanMagnusson · 2023-11-08T07:27:37Z

And now falcon:

decon

dolma -c configs/baselines/decontamination/falcon-refinedweb.yaml dedupe

mix

dolma -c configs/baselines/mixing/falcon-refinedweb.json mix --processes 224

check doc removal

aws s3 sync s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0-0.05-heldout-complement/ /mnt/tank/dolma_tmp/results/falcon-refinedweb/v0-0.05-heldout-complement/ 

parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/dolma_tmp/results/falcon-refinedweb/v0-0.05-heldout-complement/attributes/perplexity_suite_v3_option2/*.gz | wc -l

parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/dolma_tmp/results/falcon-refinedweb/v0-0.05-heldout-complement/attributes/perplexity_suite_v3_option2/*.gz | awk '{sum += $1} END {print sum}'

912114192 / 918848690 = 0.9926707214 docs retained

Tokenize

dolma -c configs/baselines/tokenization/falcon-refinedweb.yaml tokens

IanMagnusson · 2023-11-18T00:49:22Z

We're redoing Pile tokenization now cuz of a bug when tokenizing with more parallel processes than files in the dataset. We push a new config and run:

dolma -c configs/baselines/tokenization/pile.yaml tokens

resulting in:

dolma -c configs/baselines/tokenization/pile.yaml tokens
batch_size: 10000
debug: false
destination: s3://ai2-llm/preprocessed/pile/v0_decon_ppl_suite_v3_fixed/gpt-neox-20b-pii-special
documents:
- s3://ai2-llm/pretraining-data/sources/pile/v0_decon_ppl_suite_v3/*.json.gz
dryrun: false
dtype: uint16
files_per_process: null
max_size: 1073741824
processes: 150
ring_size: 8
seed: 3920
tokenizer_name_or_path: allenai/eleuther-ai-gpt-neox-20b-pii-special
work_dir:
  input: /mnt/tank/tmp/pile_v0_decon_ppl_suite_v3_fixed_input
  output: /mnt/tank/tmp/pile_v0_decon_ppl_suite_v3_fixed_output
memmaps: 300m [2:45:43, 33.1s/m]
tokens: 307Gt [2:45:43, 30.9Mt/s]s]  
documents: 205Md [2:45:43, 20.6kd/s]
files: 150f [2:45:43, 66.3s/f]]

IanMagnusson · 2023-11-18T21:20:25Z

Now lets do c4:

conda activate dolma-baselines-fixed
export TMPDIR=/mnt/tank/tmp/

This data is already deconned for Dolma, so we go right to check removal

aws s3 sync s3://ai2-llm/pretraining-data/sources/c4/v0/attributes/perplexity_suite_v3_option2/ /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2/ 

parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2/train/*.gz | wc -l

parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2/train/*.gz | awk '{sum += $1} END {print sum}'

364156258 / 364156258 = 100% documents retained

This seems unlikely so we try deconning again.

dolma -c configs/baselines/decontamination/c4.yaml dedupe

check again:

aws s3 sync s3://ai2-llm/pretraining-data/sources/c4/v0/attributes/perplexity_suite_v3_option2_redo/ /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2_redo/ 

parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2_redo/train/*.gz | wc -l

parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/dolma_tmp/results/c4/v0/attributes/perplexity_suite_v3_option2_redo/train/*.gz | awk '{sum += $1} END {print sum}'

364121142 / 364156258 = 0.9999035689 doc retention rate

Mix

dolma -c configs/baselines/mixing/c4.json mix --processes 224

check the number of files to make sure its > 224 (cpus on this machine)

aws s3 ls s3://ai2-llm/pretraining-data/sources/c4/v0_decon_ppl_suite_v3/ | grep .json.gz | wc -l

496 files

Tokenize

dolma -c configs/baselines/tokenization/c4.yaml tokens

batch_size: 10000
debug: false
destination: s3://ai2-llm/preprocessed/c4/v0_decon_ppl_suite_v3/gpt-neox-20b-pii-special
documents:
- s3://ai2-llm/pretraining-data/sources/c4/v0_decon_ppl_suite_v3/*.json.gz
dryrun: false
dtype: uint16
files_per_process: null
max_size: 1073741824
processes: 224
ring_size: 8
seed: 3920
tokenizer_name_or_path: allenai/eleuther-ai-gpt-neox-20b-pii-special
work_dir:
  input: /mnt/tank/dolma_tmp/c4_input_tokenized
  output: /mnt/tank/dolma_tmp/c4_output_tokenized
memmaps: 233m [1:17:23, 19.9s/m]
tokens: 174Gt [1:17:23, 37.6Mt/s]]  
documents: 364Md [1:17:23, 78.4kd/s]
files: 496f [1:17:23, 9.36s/f]m]

IanMagnusson · 2023-11-19T06:16:11Z

Now mc4:

conda activate dolma-baselines-fixed
export TMPDIR=/mnt/tank/tmp/

dedup

dolma -c configs/baselines/decontamination/mc4.yaml dedupe

Check removal

parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/ai2-llm/pretraining-data/sources/mc4/en-wimbd-splits/attributes/perplexity_suite_v3_option2/train/*.gz | wc -l

parallel --eta --bar "zcat {} | wc -l" ::: /mnt/tank/ai2-llm/pretraining-data/sources/mc4/en-wimbd-splits/attributes/perplexity_suite_v3_option2/train/*.gz  | awk '{sum += $1} END {print sum}'

3928652800 / 3928733374 = 0.9999794911

Mix

dolma -c configs/baselines/mixing/mc4.json mix --processes 224

Tokenize

dolma -c configs/baselines/tokenization/mc4.yaml tokens

IanMagnusson · 2023-12-01T20:29:02Z

Now we'll make a dolma-cc-only dataset. This just needs tokenization but for some reason it needs the code at main on afab18c

conda create -n dolma-main-latest python=3.10
conda activate dolma-main-latest
mv target/wheels/ target/wheels_bak
make setup
maturin build -r
pip install target/wheels/dolma-0.9.2-cp310-cp310-manylinux_2_34_x86_64.whl

Then tokenize:

dolma -c configs/baselines/tokenization/dolma_v1_5_cc_only.yaml tokens

soldni and others added 30 commits August 28, 2023 05:20

testing warc

4da316e

ignore

7fc2c9c

testing slow

b35e6ee

langdetect

4979ea7

optional import

b6f96c3

refactoring

bd903b2

wip

8c9e00a

style

2d97f76

wip

d25287d

test

0350a5f

wip

c08bc6f

configs

294ffca

hash sample

32610e5

small improvements

ab0d741

updated with output

d0cde79

more details

5562666

updated readme

3909b7f

decon wip

f1e463a

new confits

f9ed26d

taggging content

a3b08c3

Merge pull request #49 from allenai/main

07a745c

ff

changed name of file

ba2a413

fixes

4b2fb1b

deal with empty docs/local files

534c3c2

increased bloom size

0ccf67c

configs for rest of splits

59559d8

switching to option2

39405e8

forgot to do two more

8c2af40

finding puctuation

c1c5b54

tokenizer porting

abaf44d

trying to debug decon

3fe273e

soldni self-assigned this Oct 25, 2023

Merge branch 'main' into baselines

2b97d61

Merge remote-tracking branch 'origin/main' into baselines

591b6bc

pile

6d78c2a

IanMagnusson added 5 commits October 28, 2023 00:43

redpajama

43ea152

Had to remove this to make decon work.

b30b206

bump dolma version

c90345b

pile tokenization

921e7ad

falcon

50ab5a9

fix pile tokenization

5f30bd4

IanMagnusson added 3 commits November 18, 2023 21:35

c4 decon

48e323e

c4 mixing

b09a89c

c4 tokenization

50ac33b

mc4

dfb1c3b

cc only dolma

16295d8

Base automatically changed from soldni/warc to main May 2, 2024 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Baseline data #61

Baseline data #61

IanMagnusson commented Oct 20, 2023 •

edited

IanMagnusson commented Oct 20, 2023 •

edited

IanMagnusson commented Oct 20, 2023 •

edited

IanMagnusson commented Oct 25, 2023

soldni commented Oct 26, 2023

IanMagnusson commented Oct 26, 2023 •

edited

Step 1.1: copy data locally

Step 1.1b: change type of IDs in v3 subset (TEMPORARY FIX)

Step 1.2: tag out paragraphs by uniseg length

Step 1.3: filter out paragraphs that are too short

Step 1.4: create bloom filter

IanMagnusson commented Oct 26, 2023 •

edited

Step 1.1: copy data locally

Step 1.1b: change type of IDs in v3 subset (TEMPORARY FIX)

Step 1.2: tag out paragraphs by uniseg length

Step 1.3: filter out paragraphs that are too short

Step 1.4: create bloom filter

IanMagnusson commented Oct 26, 2023 •

edited

IanMagnusson commented Oct 27, 2023

IanMagnusson commented Nov 8, 2023 •

edited

IanMagnusson commented Nov 18, 2023 •

edited

IanMagnusson commented Nov 18, 2023 •

edited

IanMagnusson commented Nov 19, 2023 •

edited

IanMagnusson commented Dec 1, 2023 •

edited

Baseline data #61

Are you sure you want to change the base?

Baseline data #61

Conversation

IanMagnusson commented Oct 20, 2023 • edited

IanMagnusson commented Oct 20, 2023 • edited

IanMagnusson commented Oct 20, 2023 • edited

IanMagnusson commented Oct 25, 2023

soldni commented Oct 26, 2023

IanMagnusson commented Oct 26, 2023 • edited

Setup Environment

Decon

Step 1.1: copy data locally

Step 1.1b: change type of IDs in v3 subset (TEMPORARY FIX)

Step 1.2: tag out paragraphs by uniseg length

Step 1.3: filter out paragraphs that are too short

Step 1.4: create bloom filter

IanMagnusson commented Oct 26, 2023 • edited

Setup Environment

Decon

Step 1.1: copy data locally

Step 1.1b: change type of IDs in v3 subset (TEMPORARY FIX)

Step 1.2: tag out paragraphs by uniseg length

Step 1.3: filter out paragraphs that are too short

Step 1.4: create bloom filter

IanMagnusson commented Oct 26, 2023 • edited

IanMagnusson commented Oct 27, 2023

IanMagnusson commented Nov 8, 2023 • edited

IanMagnusson commented Nov 18, 2023 • edited

IanMagnusson commented Nov 18, 2023 • edited

IanMagnusson commented Nov 19, 2023 • edited

IanMagnusson commented Dec 1, 2023 • edited

IanMagnusson commented Oct 20, 2023 •

edited

IanMagnusson commented Oct 20, 2023 •

edited

IanMagnusson commented Oct 20, 2023 •

edited

IanMagnusson commented Oct 26, 2023 •

edited

IanMagnusson commented Oct 26, 2023 •

edited

IanMagnusson commented Oct 26, 2023 •

edited

IanMagnusson commented Nov 8, 2023 •

edited

IanMagnusson commented Nov 18, 2023 •

edited

IanMagnusson commented Nov 18, 2023 •

edited

IanMagnusson commented Nov 19, 2023 •

edited

IanMagnusson commented Dec 1, 2023 •

edited