my-pytorch-bert

These codes are BERT implementation by PyTorch.

The base of this implementation is google BERT and pytorch-pretrained-BERT. And we add bert-japanese as SentencePiece Tokenizer.

How to convert from TensorFlow model to my model

python load_tf_bert.py \
    --config_path=multi_cased_L-12_H-768_A-12/bert_config.json \
    --tfmodel_path=multi_cased_L-12_H-768_A-12/model.ckpt-1400000 \
    --output_path=pretrain/multi_cased_L-12_H-768_A-12.pt

config json-file example:

{
	"vocab_size": 32000,
	"hidden_size": 768,
	"num_hidden_layers": 12,
	"num_attention_heads": 12,
	"intermediate_size": 3072,
	"attention_probs_dropout_prob": 0.1,
	"hidden_dropout_prob": 0.1,
	"max_position_embeddings": 512,
	"type_vocab_size": 2,
	"initializer_range": 0.02
}

How to Classifier train

python run_classifier.py \
 --config_path=config/bert_base.json  \
 --train_dataset_path=/content/drive/My\ Drive/data/sample_train.tsv \
 --pretrain_path=/content/drive/My\ Drive/pretrain/bert.pt \
 --vocab_path=/content/drive/My\ Drive/data/sample.vocab \
 --sp_model_path=/content/drive/My\ Drive/data/sample.model \
 --save_dir=classifier/  \
 --batch_size=4  \
 --max_pos=512  \
 --lr=2e-5  \
 --warmup_steps=0.1  \
 --epoch=10  \
 --per_save_epoch=1 \
 --mode=train \
 --label_num=9

How to Classifier evaluate

python run_classifier.py \
 --config_path=config/bert_base.json \
 --eval_dataset_path=/content/drive/My\ Drive/data/sample_eval.tsv \
 --model_path=/content/drive/My\ Drive/classifier/classifier.pt \
 --vocab_path=/content/drive/My\ Drive/data/sample.vocab \
 --sp_model_path=/content/drive/My\ Drive/data/sample.model \
 --max_pos=512 \
 --mode=eval \
 --label_num=9

How to train Sentence Piece

python train-sentencepiece.py --config_path=json-file

json-file example:

{
    "text_dir" : "tests/",
    "prefix" : "tests/sample_text",
    "vocab_size" : 100,
    "ctl_symbols" : "[PAD],[CLS],[SEP],[MASK]"
}

How to pre-train

python run_pretrain.py \
 --config_path=config/bert_base.json \
 --dataset_path=/content/drive/My\ Drive/data/sample.txt \
 --vocab_path=/content/drive/My\ Drive/data/sample.vocab \
 --sp_model_path=/content/drive/My\ Drive/data/sample.model \
 --save_dir=pretrain/ \
 --batch_size=4 \
 --max_pos=256 \
 --lr=5e-5 \
 --warmup_steps=0.1 \
 --epoch=20 \
 --per_save_epoch=4 \
 --mode=train

Use FP16 (Pascal CUDA)

git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install --cuda_ext --cpp_ext

and '--fp16' option attach.

Tested by Google Colaboratory GPU type only.

Selection of Tokenizer to use

python run_classifier.py \
 --config_path=config/bert_base.json  \
 --train_dataset_path=/content/drive/My\ Drive/data/sample_train.tsv \
 --pretrain_path=/content/drive/My\ Drive/pretrain/bert.pt \
 --vocab_path=/content/drive/My\ Drive/data/sample.vocab \
 --save_dir=classifier/  \
 --batch_size=4  \
 --max_pos=512  \
 --lr=2e-5  \
 --warmup_steps=0.1  \
 --epoch=10  \
 --per_save_epoch=1 \
 --mode=train \
 --label_num=9
 --tokenizer=mecab

'--tokenizer' becomes effective when '--sp_model_path' option is not attached.

tokenizer : mecab | juman | other-strings (google-bert basic tokenizer)

use MeCab

apt-get -q -y install sudo file mecab libmecab-dev mecab-ipadic-utf8 git curl python-mecab 
git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git 
echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n 
pip install mecab-python3

use Juman++

wget https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc2/jumanpp-2.0.0-rc2.tar.xz
tar xfv jumanpp-2.0.0-rc2.tar.xz  
cd jumanpp-2.0.0-rc2
mkdir bld
cd bld
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr/local # where to install Juman++
make install -j4 
pip install pyknp
pip install mojimoji

Classification result of my-pytorch-bert

Dataset : livedoor ニュースコーパス 6(training): 2(test) 2(dev un-use)
train epoch : 10

Pretrained BERT model and trained SentencePiece model (model converted).

              precision    recall  f1-score   support

           0       0.99      0.92      0.95       178
           1       0.95      0.97      0.96       172
           2       0.99      0.97      0.98       176
           3       0.95      0.92      0.93        95
           4       0.98      0.99      0.98       158
           5       0.92      0.98      0.95       174
           6       0.97      1.00      0.98       167
           7       0.98      0.99      0.99       190
           8       0.99      0.96      0.97       163

   micro avg       0.97      0.97      0.97      1473
   macro avg       0.97      0.97      0.97      1473
weighted avg       0.97      0.97      0.97      1473

BERT日本語Pretrainedモデル (model converted).

              precision    recall  f1-score   support

           0       0.98      0.92      0.95       178
           1       0.92      0.94      0.93       172
           2       0.98      0.96      0.97       176
           3       0.93      0.83      0.88        95
           4       0.97      0.99      0.98       158
           5       0.91      0.97      0.94       174
           6       0.95      0.98      0.96       167
           7       0.97      0.99      0.98       190
           8       0.97      0.96      0.96       163

   micro avg       0.95      0.95      0.95      1473
   macro avg       0.95      0.95      0.95      1473
weighted avg       0.95      0.95      0.95      1473

Acknowledgments

This project incorporates code from the following repos:

This project incorporates dict from the following repos:

http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt

Name		Name	Last commit message	Last commit date
Latest commit History 193 Commits
bertviz		bertviz
classifier		classifier
config		config
dict		dict
logs		logs
mptb		mptb
pretrain		pretrain
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bertviz_detail_my_bert.ipynb		bertviz_detail_my_bert.ipynb
bertviz_map_my_bert.ipynb		bertviz_map_my_bert.ipynb
bertviz_summary_my_bert.ipynb		bertviz_summary_my_bert.ipynb
create_pretrain_file.py		create_pretrain_file.py
extract_bert_model.py		extract_bert_model.py
load_tf_bert.py		load_tf_bert.py
parse-text-ginza.py		parse-text-ginza.py
parse-text-juman.py		parse-text-juman.py
requirements.txt		requirements.txt
run_classifier.py		run_classifier.py
run_pretrain.py		run_pretrain.py
run_regressor.py		run_regressor.py
setup.py		setup.py
train-mecab.py		train-mecab.py
train-sentencepiece.py		train-sentencepiece.py

License

convergence-lab/my-pytorch-bert

Folders and files

Latest commit

History

Repository files navigation

my-pytorch-bert

How to convert from TensorFlow model to my model

How to Classifier train

How to Classifier evaluate

How to train Sentence Piece

How to pre-train

Use FP16 (Pascal CUDA)

Selection of Tokenizer to use

use MeCab

use Juman++

Classification result of my-pytorch-bert

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Languages