Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird inconsistency in Tokenizer vocabulary #151

Open
javirandor opened this issue Mar 4, 2024 · 1 comment
Open

Weird inconsistency in Tokenizer vocabulary #151

javirandor opened this issue Mar 4, 2024 · 1 comment
Assignees

Comments

@javirandor
Copy link

javirandor commented Mar 4, 2024

Hello everyone!

I found a weird inconsistency in the tokenizer vocabulary. I wanted to ask why this could be happening.

I have loaded a tokenizer from HF:

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")

If I run

tokenizer.encode("\u200b")

The output is [12882]. However, taking a look at the vocabulary used for training (here), I cannot find the token \u200b and the token id corresponds to a different string

"\u00e2\u0122\u012d": 12882,

This seems to generally happen with unicode characters.

Why could this be happening?? I just want to make sure that the tokenizer I use for training is equivalent to the HF tokenizers since my training (as anticipated in your README) results in a weird tokenizer.

Thanks a lot :)

@haileyschoelkopf haileyschoelkopf self-assigned this Mar 4, 2024
@haileyschoelkopf
Copy link
Collaborator

I don't know exactly what's going on here yet, but I can confirm this file at utils/20B_tokenizer.json is precisely the one used for vocab_file during Pythia training.

also the following snippet shows the result upon loading the two tokenizers and encoding \u200b:

>>> tok1 = transformers.PretrainedTokenizerFast.from_file("utils/20B_tokenizer.json")
>>> tok2 = transformers.AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")

>>> tok1("\u200b")
{'input_ids': [12882], 'token_type_ids': [0], 'attention_mask': [1]}
>>> tok2("\u200b")
{'input_ids': [12882], 'attention_mask': [1]}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants