Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reshape error in batch viewer #158

Closed
activatedgeek opened this issue May 10, 2024 · 1 comment
Closed

Reshape error in batch viewer #158

activatedgeek opened this issue May 10, 2024 · 1 comment

Comments

@activatedgeek
Copy link

activatedgeek commented May 10, 2024

Thank you for the great project!

I have successfully been able to merge all the shards from EleutherAI/pythia_deduped_pile_idxmaps.

However, while trying to get batches out of the utils/batch_viewer.py, I get the following error:

    reading sizes...
    reading pointers...
    reading document index...
    creating numpy buffer of mmap...
    creating memory view of numpy buffer...
/datasets/mmap_dataset.py:226: RuntimeWarning: overflow encountered in scalar add
  offsets = list(accumulate(sizes))
Traceback (most recent call last):
  File "/datasets/batch_viewer.py", line 42, in <module>
    indicies = dataset[args.start_iteration*1024: args.end_iteration*1024 + 1]
               ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/datasets/mmap_dataset.py", line 231, in __getitem__
    return np_array.reshape(-1, 2049)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot reshape array of size 207170414058 into shape (2049)

Each sample here seems to be of uneven length, and makes sense why this code would fail.

Would you be able to help me (or just point me to a code reference) so that I can chunk the document into the 2049-sized chunks? For context, I only want to do evaluations on top of a subset of training data. I want the chunks to be constructed precisely the same way as during training so that I put them in a dataloader and simply subsample on top (perhaps something like a torch.utils.data.Subset).

@activatedgeek activatedgeek changed the title Error in batch viewer Reshape error in batch viewer May 10, 2024
@activatedgeek
Copy link
Author

It looks like the right dataset to use there is EleutherAI/pile-deduped-pythia-preshuffled, which gives even 2049-sized across all samples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant