SAPar

This is the implementation of Constituency Parsing with Span Attention at Findings of EMNLP2020.

Please contact us at yhtian@uw.edu if you have any questions.

Visit our homepage to find more our recent research and softwares for NLP (e.g., pre-trained LM, POS tagging, NER, sentiment analysis, relation extraction, datasets, etc.).

Upgrades of SAPar

We are improving our SAPar. For updates, please visit HERE.

Citation

If you use or extend our work, please cite our paper at Findings of EMNLP-2020.

@inproceedings{tian-etal-2020-improving,
    title = "Improving Constituency Parsing with Span Attention",
    author = "Tian, Yuanhe and Song, Yan and Xia, Fei and Zhang, Tong",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    pages = "1691--1703",
}

Prerequisites

python 3.6
pytorch 1.1

Install python dependencies by running:

pip install -r requirements.txt

EVALB and EVALB_SPMRL contain the code to evaluate the parsing results for English and other languages. Before running evaluation, you need to go to the EVALB (for English) or EVALB_SPMRL (for other languages) and run make.

Downloading BERT, ZEN, XLNet and Our Pre-trained Models

In our paper, we use BERT, ZEN, and XLNet as the encoder.

For BERT, please download pre-trained BERT model from Google and convert the model from the TensorFlow version to PyTorch version.

For Arabic, we use MulBERT-Base, Multilingual Cased.
For Chinese, we use BERT-Base, Chinese;
For English, we use BERT-Large, Cased and BERT-Large, Uncased.

For ZEN, you can download the pre-trained model from here.

For XLNet, you can download the pre-trained model from here.

For our pre-trained model, you can download them from Baidu Wangpan (passcode: 2o1n) or Google Drive.

Run on Sample Data

To train a model on a small dataset, run:

./run.sh

Datasets

We use datasets in three languages: Arabic, Chinese, and English.

Arabic: we use ATB2.0 part 1-3 (LDC2003T06, LDC2004T02, and LDC2005T20).
Chinese: we use CTB5 (LDC2005T01).
English: we use PTB (LDC99T42).

To preprocess the data, please go to data_processing directory and follow the instruction to process the data. You need to obtain the official datasets yourself before running our code.

Ideally, all data will appear in ./data directory. The data with gold POS tags are located in folders whose name is the same as the dataset name (i.e., ATB, CTB, and PTB); the data with predicted POS tags are located in folders whose name has a "_POS" suffix (i.e., ATB_POS, CTB_POS, and PTB_POS).

Training, Testing, and Predicting

You can find the command lines to train and test models on a specific dataset in run.sh.

To-do List

Regular maintenance.

You can leave comments in the Issues section, if you want us to implement any functions.

You can check our updates at updates.md.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
EVALB		EVALB
EVALB_SPMRL		EVALB_SPMRL
data_processing		data_processing
pytorch_pretrained_bert		pytorch_pretrained_bert
pytorch_pretrained_zen		pytorch_pretrained_zen
pytorch_transformers		pytorch_transformers
sample_data		sample_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SAPar_main.py		SAPar_main.py
SAPar_model.py		SAPar_model.py
attutil.py		attutil.py
chart_helper.pyx		chart_helper.pyx
evaluate.py		evaluate.py
nkutil.py		nkutil.py
requirements.txt		requirements.txt
run.sh		run.sh
transliterate.py		transliterate.py
trees.py		trees.py
updates.md		updates.md
vocabulary.py		vocabulary.py

License

cuhksz-nlp/SAPar

Folders and files

Latest commit

History

Repository files navigation

SAPar

Upgrades of SAPar

Citation

Prerequisites

Downloading BERT, ZEN, XLNet and Our Pre-trained Models

Run on Sample Data

Datasets

Training, Testing, and Predicting

To-do List

About

Resources

License

Stars

Watchers

Forks

Languages