Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading data is slowly! #126

Open
Lisennlp opened this issue Nov 2, 2023 · 1 comment
Open

Reading data is slowly! #126

Lisennlp opened this issue Nov 2, 2023 · 1 comment

Comments

@Lisennlp
Copy link

Lisennlp commented Nov 2, 2023

I followed readme:

  git lfs clone https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idxmaps
  python utils/unshard_memmap.py --input_file ./pythia_deduped_pile_idxmaps/pile_0.87_deduped_text_document-00000-of-00082.bin --num_shards 83 --output_dir ./pythia_pile_idxmaps/

I got a 600+G file, and then I used gpt-neox's dataloader to read the data, which was very slow. It takes about 6s to read 2048-length pieces of data. May I ask why?

image

@liu09114
Copy link

I get a file onlu 386G.. "386G Jan 30 13:28 pile_0.87_deduped_text_document.bin"
And I didn't get the '*.idx' file, should we use the download idx file directly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants