Sentencepiece with pre-defined vocabulary #571

vladmosin · 2020-10-22T14:27:58Z

Could you, please, explain if it is possible to initialize the sentencepiece algorithm on a pre-defined vocabulary. If it is not, it seems that would be a really useful option.

taku910 · 2020-10-24T03:59:01Z

Could you elaborate your request?

"initialize" means that we train the spm model with pre-defined vocab? or just feed pre-defined vocab in segmentation time?
The former can be technically possible in unigram mode, but not implemented yet.
For the latter request, you might be able to set_vocabulary method to restrict the vocab. However, the pre-defined vocab must be a subset of the default vocab.

If you really overwrite the vocab, you can rewrite the mode file directory, but it is an advanced usage and at-your-own-risk basis.
https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb

AdolfVonKleist · 2020-10-28T14:11:19Z

I have a similar question, perhaps almost the same. In some applications which utilize sentencepiece models in a later downstream process [ASR, MT, etc.] the segmentation vocabulary is typically incorporated into the model. However if the sentencepiece model is lost, it is no longer possible to perform adaptation with the downstream ASR model. As a concrete example:

I have an espnet ASR model which I trained with a significant amount of data
I lost (overwrote by accident) the sentencepiece model
I still have the vocabulary, and working espnet model, which contains the vocabulary list from the original sentencepiece model
I have all the text training data - but retraining produces a similar but not exactly the same model (8002 vs 8004 pieces for a 8000 piece target)

It would be cool and very useful to be able to retrain the sentencepiece model with the exact, fixed vocabulary from the existing espnet model, since it would save the time required to retrain the ASR model from scratch.

kr0niker · 2020-10-29T13:28:13Z

I guess, the example by @AdolfVonKleist is very useful. As far as I understand the question by @vladmosin is: since sentencepiece starts with some vocabulary that it then trims, it could be useful to have an option that would allow to pass a pre-defined given vocabulary to sentecepiece.

taku910 · 2020-10-30T15:28:49Z

Strictly speaking, it is not possible to reproduce the same result only from the vocab. BPE and unigram language model manages the score for each token. This score cannot be reproduced anyway.

If you are using unigram language model, the score is essentially the same as the unigram negative log prob. Not sure how to reproduce the segmentation for BPE.

adisuyash · 2023-05-14T18:28:49Z

Can someone help me that what Tech-Stack should I must be familiar, in order to contribute to azure?
I want to learn much more things for my career growth and much more active open source participation.

ashutoshbsathe · 2024-05-14T06:33:15Z

"initialize" means that we train the spm model with pre-defined vocab? or just feed pre-defined vocab in segmentation time? The former can be technically possible in unigram mode, but not implemented yet.

@taku910 Is this implemented already? or any pointers on how this can be implemented? I am interested in training a unigram model with a predefined vocabulary i.e. it won't be expanded or reduced during training at all.

taku910 added feature request Add new feature help wanted labels Jan 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentencepiece with pre-defined vocabulary #571

Sentencepiece with pre-defined vocabulary #571

vladmosin commented Oct 22, 2020

taku910 commented Oct 24, 2020

AdolfVonKleist commented Oct 28, 2020

kr0niker commented Oct 29, 2020

taku910 commented Oct 30, 2020

adisuyash commented May 14, 2023

ashutoshbsathe commented May 14, 2024

Sentencepiece with pre-defined vocabulary #571

Sentencepiece with pre-defined vocabulary #571

Comments

vladmosin commented Oct 22, 2020

taku910 commented Oct 24, 2020

AdolfVonKleist commented Oct 28, 2020

kr0niker commented Oct 29, 2020

taku910 commented Oct 30, 2020

adisuyash commented May 14, 2023

ashutoshbsathe commented May 14, 2024