PromptEM: Prompt-tuning for Low-resource Generalized Entity Matching

PromptEM is a novel low-resource GEM (Generalized Entity Matching) solution powered by prompt-tuning and self-training. To address the gap between pre-training and fine-tuning, we cast GEM as a cloze-style task and design the GEM-specific prompt-tuning, which can stimulate the rich knowledge distributed in LMs. To select high-quality pseudo-labels, we develop a lightweight uncertainty-aware self-training method to boost performance. To further avoid expensive self-training, we prune useless training data dynamically using the proposed MC-EL2N, making the self-training process more lightweight and efficient.

For more technical details, see PromptEM: Prompt-tuning for Low-resource Generalized Entity Matching.

Datasets

We use eight real-world benchmark datasets with different structures from Machamp and Geo-ER.

Quick Start

To train and evaluate with PromptEM.

python main.py [<args>] [-h | --help]

e.g.

python main.py -d=rel-heter -k=0.1 -st -dd=8 -ur=0.05 -er=0.05

The meaning of the flags:

--model_name_or_path: the name or local path of the pre-trained language model. e.g. roberta-base
--data_name: the name of the dataset. options: [rel-heter, rel-text, semi-heter, semi-homo, semi-rel, semi-text-c,semi-text-w, geo-heter, all]
--k: the proportion of training data used. e.g. 0.1
--num_iter: the number of iterations. e.g. 1
--template_no: the number of templates used in PromptEM. options: [0,1,2,3]
--self_training: the flag to enable self-training of PromptEM.
--dynamic_dataset: the frequency of dynamic dataset pruning. e.g. 8 (pruning for every 8 epochs)
--pseudo_label_method: the method of generating pseudo labels. e.g. uncertainty
--mc_dropout_pass: the number of MC-Dropout passes. e.g. 10
--uncertainty_ratio: the proportion of the unlabeled samples for generating pseudo labels. e.g. 0.05
--el2n_ratio: the proportion of the labeled samples for dynamic dataset pruning. e.g. 0.1
--text_summarize: the flag to enable text summarization in entity serialization.
--add_token: the flag to add special token in entity serialization.
--max_length: the maximum length (in number of tokens) for the inputs to the transformer model. e.g. 512
--teacher_epochs: the number of epochs for training teacher model. e.g. 20
--student_epochs : the number of epochs for training student model. e.g. 30
--batch_size: batch size. e.g. 32
--lr: learning rate. e.g.1e-5

Reproduction

All the experiments are conducted on an Ubuntu Server with an Intel Xeon Silver 4216 CPU and an NVIDIA A100 GPU.

Initial the environment

conda create -n promptem python=3.7
const activate promptem
pip install torch==1.10.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install transformers==4.16.2
pip install scikit-learn==1.0.2

Note that you do not need to install OpenPrompt by pip manually.

We notice that the best hyper-parameters can be sensitive to your server environment and package version. If you do not have the same environment, we highly recommend you run the search for hyper-parameters in your environment.

We provide an example search script in search.sh

Download the PLM [Optional]

We use RoBERTa-base as the backbone structure of our model in all the experiments.

You can download the pre-trained checkpoint from huggingface manually.

Reproduce the result

You can train the model using the best hyper-parameters we provided in low_configs.json.

We also provide the corresponding logs in logs.

See Quick Start for more details of training parameters.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
logs		logs
openprompt		openprompt
License		License
README.md		README.md
args.py		args.py
data.py		data.py
low_configs.json		low_configs.json
main.py		main.py
model.py		model.py
prompt-tuning.jpg		prompt-tuning.jpg
prompt.py		prompt.py
pseudo_label.py		pseudo_label.py
search.sh		search.sh
summarize.py		summarize.py
train.py		train.py
utils.py		utils.py

License

ZJU-DAILY/PromptEM

Folders and files

Latest commit

History

Repository files navigation

PromptEM: Prompt-tuning for Low-resource Generalized Entity Matching

Datasets

Quick Start

Reproduction

Initial the environment

Download the PLM [Optional]

Reproduce the result

About

Topics

Resources

License

Stars

Watchers

Forks

Languages