Vocabulary builder for BERT

Modified, simplified version of text_encoder_build_subword.py and its dependencies included in tensor2tensor library, making its output fits to google research's open-sourced BERT project.

Although google opened pre-trained BERT and training scripts, they didn't open source to generate wordpiece(subword) vocabulary matches to vocab.txt in opened model.
And the libraries they suggested to use were not compatible with their tokenization.py of BERT as they mentioned.
So I modified text_encoder_build_subword.py of tensor2tensor library that is one of the suggestions google mentioned to generate wordpiece vocabulary.

Modifications

Original SubwordTextEncoder adds "_" at the end of subwords appear on the first position of words. So I changed to add "_" at the beginning of subwords that follow other subwords, using _my_escape_token() function, and later substitued "_" with "##"
Generated vocabulary contains all characters and all characters having "##" in front of them. For example, a and ##a.
Made standard special characters like !?@~ and special tokens used for BERT, ex : [SEP], [CLS], [MASK], [UNK] to be added.
Removed irrelevant classes in text_encoder.py, commented unused functions some of which seem to exist for decoding, and removed mlperf_log module to make this project independent to tensor2tensor library.

Requirement

The environment I made this project in consists of :

python3.6
tensorflow 1.11

Basic usage

python subword_builder.py \
--corpus_filepattern "{corpus_for_vocab}" \
--output_filename {name_of_vocab}
--min_count {minimum_subtoken_counts}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
README.md		README.md
subword_builder.py		subword_builder.py
text_encoder.py		text_encoder.py
tokenizer.py		tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

subword_builder.py

subword_builder.py

text_encoder.py

text_encoder.py

tokenizer.py

tokenizer.py

Repository files navigation

Vocabulary builder for BERT

Modifications

Requirement

Basic usage

About

Releases

Packages

Contributors 2

Languages

kwonmha/bert-vocab-builder

Folders and files

Latest commit

History

Repository files navigation

Vocabulary builder for BERT

Modifications

Requirement

Basic usage

About

Topics

Resources

Stars

Watchers

Forks

Languages