-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sentencepiece with pre-defined vocabulary #571
Comments
Could you elaborate your request? "initialize" means that we train the spm model with pre-defined vocab? or just feed pre-defined vocab in segmentation time? If you really overwrite the vocab, you can rewrite the mode file directory, but it is an advanced usage and at-your-own-risk basis. |
I have a similar question, perhaps almost the same. In some applications which utilize sentencepiece models in a later downstream process [ASR, MT, etc.] the segmentation vocabulary is typically incorporated into the model. However if the sentencepiece model is lost, it is no longer possible to perform adaptation with the downstream ASR model. As a concrete example:
It would be cool and very useful to be able to retrain the sentencepiece model with the exact, fixed vocabulary from the existing espnet model, since it would save the time required to retrain the ASR model from scratch. |
I guess, the example by @AdolfVonKleist is very useful. As far as I understand the question by @vladmosin is: since sentencepiece starts with some vocabulary that it then trims, it could be useful to have an option that would allow to pass a pre-defined given vocabulary to sentecepiece. |
Strictly speaking, it is not possible to reproduce the same result only from the vocab. BPE and unigram language model manages the score for each token. This score cannot be reproduced anyway. If you are using unigram language model, the score is essentially the same as the unigram negative log prob. Not sure how to reproduce the segmentation for BPE. |
Can someone help me that what Tech-Stack should I must be familiar, in order to contribute to azure? |
@taku910 Is this implemented already? or any pointers on how this can be implemented? I am interested in training a unigram model with a predefined vocabulary i.e. it won't be expanded or reduced during training at all. |
Could you, please, explain if it is possible to initialize the sentencepiece algorithm on a pre-defined vocabulary. If it is not, it seems that would be a really useful option.
The text was updated successfully, but these errors were encountered: