Skip to content

cuhksz-nlp/SAPar

Repository files navigation

SAPar

This is the implementation of Constituency Parsing with Span Attention at Findings of EMNLP2020.

Please contact us at yhtian@uw.edu if you have any questions.

Visit our homepage to find more our recent research and softwares for NLP (e.g., pre-trained LM, POS tagging, NER, sentiment analysis, relation extraction, datasets, etc.).

Upgrades of SAPar

We are improving our SAPar. For updates, please visit HERE.

Citation

If you use or extend our work, please cite our paper at Findings of EMNLP-2020.

@inproceedings{tian-etal-2020-improving,
    title = "Improving Constituency Parsing with Span Attention",
    author = "Tian, Yuanhe and Song, Yan and Xia, Fei and Zhang, Tong",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    pages = "1691--1703",
}

Prerequisites

  • python 3.6
  • pytorch 1.1

Install python dependencies by running:

pip install -r requirements.txt

EVALB and EVALB_SPMRL contain the code to evaluate the parsing results for English and other languages. Before running evaluation, you need to go to the EVALB (for English) or EVALB_SPMRL (for other languages) and run make.

Downloading BERT, ZEN, XLNet and Our Pre-trained Models

In our paper, we use BERT, ZEN, and XLNet as the encoder.

For BERT, please download pre-trained BERT model from Google and convert the model from the TensorFlow version to PyTorch version.

  • For Arabic, we use MulBERT-Base, Multilingual Cased.
  • For Chinese, we use BERT-Base, Chinese;
  • For English, we use BERT-Large, Cased and BERT-Large, Uncased.

For ZEN, you can download the pre-trained model from here.

For XLNet, you can download the pre-trained model from here.

For our pre-trained model, you can download them from Baidu Wangpan (passcode: 2o1n) or Google Drive.

Run on Sample Data

To train a model on a small dataset, run:

./run.sh

Datasets

We use datasets in three languages: Arabic, Chinese, and English.

To preprocess the data, please go to data_processing directory and follow the instruction to process the data. You need to obtain the official datasets yourself before running our code.

Ideally, all data will appear in ./data directory. The data with gold POS tags are located in folders whose name is the same as the dataset name (i.e., ATB, CTB, and PTB); the data with predicted POS tags are located in folders whose name has a "_POS" suffix (i.e., ATB_POS, CTB_POS, and PTB_POS).

Training, Testing, and Predicting

You can find the command lines to train and test models on a specific dataset in run.sh.

To-do List

  • Regular maintenance.

You can leave comments in the Issues section, if you want us to implement any functions.

You can check our updates at updates.md.