Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special tokens not showing up correctly when tokenized. #29

Open
amazingvince opened this issue Nov 5, 2023 · 1 comment
Open

Special tokens not showing up correctly when tokenized. #29

amazingvince opened this issue Nov 5, 2023 · 1 comment

Comments

@amazingvince
Copy link

I tried adding some special tokens to the vocab of a pretrained model. Made a PR for minor code fix. When I try and encode strings these new tokens are being broken out into many tokens sometimes instead of being encoded as single token.

How do I make sure my special tokens always map to the same id?
code to reproduce what I am seeing:

vocab = tokenmonster.load("englishcode-32000-consistent-v1")

vocab.modify(["<|im_start|>", "<|im_end|>", "<s>"], None, None, 0)


vocab.resize(32000, reset_token_ids=False)


# Tokenize some text
text = [
    "<s>Some text to turn into token IDs. Why is this happening?<|im_end|>",
    "<s>Some text to turn into token IDs. <|im_end|>",
    "<s>Some text to turn into token IDs....<|im_end|>",
]
@alasdairforsythe
Copy link
Owner

It's unclear what you're trying to do, what you expect to happen, and what is happening. Please provide the results of what you get, and a description of what you expected to get.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants