Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data out of bounds when using ‘dolma tokens --dtype uint32’ #142

Open
Jackwaterveg opened this issue Mar 25, 2024 · 1 comment
Open

Comments

@Jackwaterveg
Copy link

Jackwaterveg commented Mar 25, 2024

image

After using commad

dolma tokens \
    --documents "dataset/${data_source}_add_id" \
    --tokenizer.name_or_path Qwen/Qwen1.5-7B-Chat \
    --destination dataset/${data_source}_npy \
    --tokenizer.eos_token_id 151643\
    --tokenizer.pad_token_id 151646 \
    --dtype "uint32" \
    --processes 20

I use the code below to read the memmap file. The data is out of bounds as shown above and the vocab size is only 150000.
data = MemMapDataset(filePath, chunk_size=2048, memmap_dtype="uint32")

@soldni
Copy link
Member

soldni commented Apr 5, 2024

Thank you for the report @Jackwaterveg. Could you re-run the command above with --dryrun to show the full configuration? thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants