tokens listed in user_defined_symbols tokenized as unknowns when using the "word" model_type #801

lintangsutawika · 2022-12-15T11:18:24Z

When using model_type="word" as argument in spm.SentencePieceTrainer.train, it seems that tokens listed in user_defined_symbols for example user_defined_symbols=["<s>", "</s>", "."], are still encoded to the unk_id. Using BPE, and Char works.

Is this intended for word models?

The text was updated successfully, but these errors were encountered:

taku910 added the bug label Apr 3, 2023

taku910 self-assigned this Apr 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokens listed in user_defined_symbols tokenized as unknowns when using the "word" model_type #801

tokens listed in user_defined_symbols tokenized as unknowns when using the "word" model_type #801

lintangsutawika commented Dec 15, 2022

tokens listed in user_defined_symbols tokenized as unknowns when using the "word" model_type #801

tokens listed in user_defined_symbols tokenized as unknowns when using the "word" model_type #801

Comments

lintangsutawika commented Dec 15, 2022