Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running paragraph level deduplication on c4 #150

Open
andrewhojel opened this issue Apr 20, 2024 · 2 comments
Open

Running paragraph level deduplication on c4 #150

andrewhojel opened this issue Apr 20, 2024 · 2 comments

Comments

@andrewhojel
Copy link

I am trying to run paragraph level deduplication using the dolma library and wanted to test it on c4. I downloaded allenai/c4 from huggingface, updated the schema to be text (string, doc content), id (long, unique id), source ("c4"), and saved it as json.gz files that are ~250MB/file. Any time I run dolma -c c4-dedupe.yaml dedupe the output attribute is always an empty list. Here is the yaml I am using (which is almost identical to the one provided at configs/dolma-v1_5/para_dedupe/c4.yaml

documents:
  - /home/c4/v0/documents/*.gz

dedupe:
  name: dedupe_paragraphs
  paragraphs:
    attribute_name: bff_duplicate_paragraph_spans
  skip_empty: true

bloom_filter:
  file: /tmp/c4.bloom
  read_only: false
  estimated_doc_count: 30000000000
  desired_false_positive_rate: 1e-06

processes: 350

the machine I am using has 360 vCPU and is running Debian 11, Python 3.10. I tried using pip install dolma and downloading the library directly from the repo (neither worked). I built a small example input as I saw in this discussion which worked totally fine. Pretty confused about this result.

I would really appreciate help / any thoughts why this might be the case.

@soldni
Copy link
Member

soldni commented May 8, 2024

uh, that is pretty confusing! could you post a sample of the data in your yaml file?

@riturajj-cerebras
Copy link

Were you able to resolve this? @andrewhojel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants