CokeBERT: Contextual Knowledge Selection and Embedding towards Enhanced Pre-Trained Language Models
-
EMNLP-Findings 2020 Accepted.
-
AI-Open 2021 Accepted.
- CokeBert-1.0 provides the original codes and details to reproduce the results in the paper.
- CokeBert-2.0-latest refactors the CokeBert-1.0 and provides more user-friendly codes for users. In this
README.md
, we mainly demostrate the usage of theCokeBert-2.0-latest
.
- python==3.8
Please install all required packages by running
bash requirements.sh
If you want to use our pre-trained Coke models directly, you can ignore this section and skip to fine-tuning part.
Go to CokeBert-2.0-latest
cd CokeBert-2.0-latest
Please follow the ERNIE pipline to pre-process your pre-training data. Note that you need to decide the backbone model and utilize its corresponding tokenizer to process the data. Coke framework supports two series of models (BERT
and RoBERTa
) currently. Then, you will obtain merbe.bin
and merge.idx
and move them to the following directories.
# BACKBONE can be `bert-base-uncased`, `roberta-base`, `bert-large-uncased`, `roberta-large`
export BACKBONE=bert-base-uncased
export HOP=2
mkdir data/pretrain/$BACKBONE
mv merge.bin data/pretrain/$BACKBONE
mv mergr.idx data/pretrain/$BACKBONE
Download the backbone model checkpoints from Huggingface, and move them to the corresponding checkpoint folder for pre-training. Note you do not download the config.json
, since we create new config for coke
.
# BACKBONE can be `bert-base-uncased`, `roberta-base`, `bert-large-uncased`, `roberta-large`
BACKBONE=bert-base-uncased
wget https://huggingface.co/$BACKBONE/resolve/main/vocab.txt -O checkpoint/coke-$BACKBONE/vocab.txt
wget https://huggingface.co/$BACKBONE/resolve/main/pytorch_model.bin -O checkpoint/coke-$BACKBONE/pytorch_model.bin
mv vocab.txt $BACKBONE/
mv pytorch_model.bin $BACKBONE/
mv $BACKBONE checkpoint/
Download the Knowledge Embedding (including entity and relation to id information) and knowledge graph neighbor information from here1 or here2. Move them to data/pretrain
folder and unzip them.
cd data/pretrain
tar zxvf kg_embed.tar.gz
rm -rf kg_embed.tar.gz
tar zxvf kg_neighbor.tar.gz
rm -rf kg_neighbor
cd ../..
(Optional) If you want to generate knowledge graph neighbors by yourself, you can
run this code to get the new kg_neighbor
data.
cd data/pretrain
python preprocess_n.py
Go to examples and run the run_pretrain.sh
.
cd example
bash run_pretrain.sh
You can assign BACKBONE
(backbone models) and HOP
(the number of hop) in run_pretrain.sh
.
# BACKBONE can be `bert-base-uncased`, `roberta-base`, `bert-large-uncased`, `roberta-large`
export BACKBONE=bert-base-uncased
export HOP=2
export PYTHONPATH=../src:$PYTHONPATH
rm outputs/pretrain_coke-$BACKBONE-$HOP/*
python run_pretrain.py \
--output_dir outputs \
--data_dir ../data/pretrain \
--backbone $BACKBONE \
--neighbor_hop $HOP \
--do_train \
--max_seq_length 256 \
--K_V_dim 100 \
--Q_dim 768 \
--train_batch_size 32 \
--self_att
It will write log and checkpoint to ./outputs
. Check CokeBert-2.0-latest/src/coke/training_args.py
for more arguments.
We download the fine-tuning datasets and the coresponding annotations from here1 or here2. Then, please unzip and save them to the corresopinding dir.
cd CokeBert-2.0-latest/data
wget https://cloud.tsinghua.edu.cn/f/3036fa28168c4fb7a320/?dl=1
mv 'index.html?dl=1' data.zip
tar -xvf data.zip finetune
(Optiona1: Load from Huggieface) You can load the pre-trained Coke checkpoints from here and start using it in python and start to fine-tune. For example, the following code demostrates how to load a 2-hop Coke bert-base
model.
from coke import CokeBertModel
model = CokeBertModel.from_pretrained('yushengsu/coke-bert-base-uncased-2hop')
# You can use this model to start fine-tune.
(Option2: Load from the local) You can also downlaod the pre-trained Coke checkpoints from here and run the following script to fine-tune. Note that you need to move the pre-trained Coke model checkpoints pytorch_model.bin
to the corresponding dir, such as DKPLM/data/DKPLM_BERTbase_2layer
for 2-hop bert-base-uncased
model and DKPLM/data/DKPLM_RoBERTabase_2layer
for 2-hop roberta-base
model.
# $BACKBONE=BACKBONE (`bert-base-uncased`, `roberta-base`, etc.)
# $HOP=HOP (1 or 2)
mv outputs/pretrain_coke-$BACKBONE-$HOP/pytorch_model.bin ../checkpoint/coke-$BACKBONE/pytorch_model.bin
Then you can start fine-tuning by running the following commands (Refer to CokeBert-2.0-latest/examples/run_finetune.sh
).
cd CokeBert-2.0-latest
bash example/run_finetune.sh
The script of run_finetune.sh
.
# BACKBONE can be `bert-base-uncased`, `roberta-base`, `bert-large-uncased`, `roberta-large`
export BACKBONE=bert-base-uncased
export HOP=2
export PYTHONPATH=../src:$PYTHONPATH
# DATASET can be `FIGER`, `OpenEntity`, `fewrel`, `tacred`
DATASET=DATASET
python3 run_finetune.py \
--output_dir outputs \
--do_train \
--do_lower_case \
--data_dir ../data/finetune/$DATASET/ \
--backbone $BACKBONE \
--neighbor_hop $HOP \
--max_seq_length 256 \
--train_batch_size 64 \
--learning_rate 2e-5 \
--num_train_epochs 16 \
--loss_scale 128 \
--K_V_dim 100 \
--Q_dim 768 \
--self_att
Please cite our paper if you use CokeBert in your work:
@article{SU2021,
title = {CokeBERT: Contextual Knowledge Selection and Embedding towards Enhanced Pre-Trained Language Models},
author = {Yusheng Su and Xu Han and Zhengyan Zhang and Yankai Lin and Peng Li and Zhiyuan Liu and Jie Zhou and Maosong Sun},
journal = {AI Open},
year = {2021},
issn = {2666-6510},
doi = {https://doi.org/10.1016/j.aiopen.2021.06.004},
url = {https://arxiv.org/abs/2009.13964},
}