Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possibile to extend a trained BPE model's merge operations? #118

Open
pluiez opened this issue Mar 14, 2023 · 2 comments
Open

Is it possibile to extend a trained BPE model's merge operations? #118

pluiez opened this issue Mar 14, 2023 · 2 comments

Comments

@pluiez
Copy link

pluiez commented Mar 14, 2023

Hi, here is the case.

  1. I pretrained a language model on English-only corpus, using BPE tokenization with vocab_size=32000.
  2. I want to continue training the model on Japanese corpus.

Since the tokenizer is unable to handle Japanese text, I'm wondering if it's possible to extend the original BPE tokenizer trained on English corpus to tokenize Japanese. So here is my idea.

  1. Train another BPE model on Japanese corpus with vocab_size=32000.
  2. Then merge the two BPE models as a new model and keep the tokenization on English unchanged so that English sentences tokenization are kept the same as before.
  3. The resulting vocab_size should be roughly 64000, in case there are some duplicates between English and Japanese vocablaries.

I'm not sure whether it's possible to merge the two BPE models as a new model and keep the tokenization on English unchanged. Any help would be appreciated!

@rsennrich
Copy link
Owner

technically, you can just concatenate the two BPE files (called codes_file in the README), and this should achieve your desired result. I've done this back in 2015 to combine Cyrillic and Latin merge operations for Russian. Two things to pay attention to:

  • the first line of the file gives some version info. You can remove this from the 2nd file that you concatenate to the first.
  • the order of the files matters, since you will get different segmentations depending on the order of merge operations.
  • if there's Latin alphabet text in the Japanese file, there is a chance that the English tokenization changes in rare cases. To prevent this, you'd have to only use the first 32000 merge operations for English text.

@pluiez
Copy link
Author

pluiez commented Mar 20, 2023

technically, you can just concatenate the two BPE files (called codes_file in the README), and this should achieve your desired result. I've done this back in 2015 to combine Cyrillic and Latin merge operations for Russian. Two things to pay attention to:

  • the first line of the file gives some version info. You can remove this from the 2nd file that you concatenate to the first.
  • the order of the files matters, since you will get different segmentations depending on the order of merge operations.
  • if there's Latin alphabet text in the Japanese file, there is a chance that the English tokenization changes in rare cases. To prevent this, you'd have to only use the first 32000 merge operations for English text.

Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants