Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokens listed in user_defined_symbols tokenized as unknowns when using the "word" model_type #801

Open
lintangsutawika opened this issue Dec 15, 2022 · 0 comments
Assignees
Labels

Comments

@lintangsutawika
Copy link

When using model_type="word" as argument in spm.SentencePieceTrainer.train, it seems that tokens listed in user_defined_symbols for example user_defined_symbols=["<s>", "</s>", "."], are still encoded to the unk_id. Using BPE, and Char works.

Is this intended for word models?

@taku910 taku910 added the bug label Apr 3, 2023
@taku910 taku910 self-assigned this Apr 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants