Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentencepiece with pre-defined vocabulary #571

Open
vladmosin opened this issue Oct 22, 2020 · 6 comments
Open

Sentencepiece with pre-defined vocabulary #571

vladmosin opened this issue Oct 22, 2020 · 6 comments
Labels

Comments

@vladmosin
Copy link

Could you, please, explain if it is possible to initialize the sentencepiece algorithm on a pre-defined vocabulary. If it is not, it seems that would be a really useful option.

@taku910
Copy link
Collaborator

taku910 commented Oct 24, 2020

Could you elaborate your request?

"initialize" means that we train the spm model with pre-defined vocab? or just feed pre-defined vocab in segmentation time?
The former can be technically possible in unigram mode, but not implemented yet.
For the latter request, you might be able to set_vocabulary method to restrict the vocab. However, the pre-defined vocab must be a subset of the default vocab.

If you really overwrite the vocab, you can rewrite the mode file directory, but it is an advanced usage and at-your-own-risk basis.
https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb

@AdolfVonKleist
Copy link
Contributor

I have a similar question, perhaps almost the same. In some applications which utilize sentencepiece models in a later downstream process [ASR, MT, etc.] the segmentation vocabulary is typically incorporated into the model. However if the sentencepiece model is lost, it is no longer possible to perform adaptation with the downstream ASR model. As a concrete example:

  • I have an espnet ASR model which I trained with a significant amount of data
  • I lost (overwrote by accident) the sentencepiece model
  • I still have the vocabulary, and working espnet model, which contains the vocabulary list from the original sentencepiece model
  • I have all the text training data - but retraining produces a similar but not exactly the same model (8002 vs 8004 pieces for a 8000 piece target)

It would be cool and very useful to be able to retrain the sentencepiece model with the exact, fixed vocabulary from the existing espnet model, since it would save the time required to retrain the ASR model from scratch.

@kr0niker
Copy link

I guess, the example by @AdolfVonKleist is very useful. As far as I understand the question by @vladmosin is: since sentencepiece starts with some vocabulary that it then trims, it could be useful to have an option that would allow to pass a pre-defined given vocabulary to sentecepiece.

@taku910
Copy link
Collaborator

taku910 commented Oct 30, 2020

Strictly speaking, it is not possible to reproduce the same result only from the vocab. BPE and unigram language model manages the score for each token. This score cannot be reproduced anyway.

If you are using unigram language model, the score is essentially the same as the unigram negative log prob. Not sure how to reproduce the segmentation for BPE.

@adisuyash
Copy link

Can someone help me that what Tech-Stack should I must be familiar, in order to contribute to azure?
I want to learn much more things for my career growth and much more active open source participation.

@ashutoshbsathe
Copy link

"initialize" means that we train the spm model with pre-defined vocab? or just feed pre-defined vocab in segmentation time? The former can be technically possible in unigram mode, but not implemented yet.

@taku910 Is this implemented already? or any pointers on how this can be implemented? I am interested in training a unigram model with a predefined vocabulary i.e. it won't be expanded or reduced during training at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants