Parsing as Tagging

This repository contains code for training and evaluation of two papers:

On Parsing as Tagging
Hexatagging: Projective Dependency Parsing as Tagging

Setting Up The Environment

Set up a virtual environment and install the dependencies:

pip install -r requirements.txt

Getting The Data

Constituency Parsing

Follow the instructions in this repo to do the initial preprocessing on English WSJ and SPMRL datasets. The default data path is data/spmrl folder, where each file titled in [LANGUAGE].[train/dev/test] format.

Dependency Parsing with Hexatagger

Convert CoNLL to Binary Headed Trees:

python data/dep2bht.py

This will generate the phrase-structured BHT trees in the data/bht directory. We placed the processed files already under the data/bht directory.

Building The Tagging Vocab

In order to use taggers, we need to build the vocabulary of tags for in-order, pre-order and post-order linearizations. You can cache these vocabularies using:

python run.py vocab --lang [LANGUAGE] --tagger [TAGGER]

Tagger can be td-sr for top-down (pre-order) shift--reduce linearization, bu-sr for bottom-up (post-order) shift--reduce linearization,tetra for in-order, and hexa for hexatagging linearization.

Training

Train the model and store the best checkpoint.

python run.py train --batch-size [BATCH_SIZE]  --tagger [TAGGER] --lang [LANGUAGE] --model [MODEL] --epochs [EPOCHS] --lr [LR] --model-path [MODEL_PATH] --output-path [PATH] --max-depth [DEPTH] --keep-per-depth [KPD] [--use-tensorboard]

batch size: use 32 to reproduce the results
tagger: td-sr or bu-sr or tetra
lang: language, one of the nine languages reported in the paper
model: bert, bert+crf, bert+lstm
model path: path that pretrained model is saved
output path: path to save the best trained model
max depth: maximum depth to keep in the decoding lattice
keep per depth: number of elements to keep track of in the decoding step
use-tensorboard: whether to store the logs in tensorboard or not (true or false)

Evaluation

Calculate evaluation metrics: fscore, precision, recall, loss.

python run.py evaluate --lang [LANGUAGE] --model-name [MODEL]  --model-path [MODEL_PATH] --bert-model-path [BERT_PATH] --max-depth [DEPTH] --keep-per-depth [KPD]  [--is-greedy]

lang: language, one of the nine languages reported in the paper
model name: name of the checkpoint
model path: path of the checkpoint
bert model path: path to the pretrained model
max depth: maximum depth to keep in the decoding lattice
keep per depth: number of elements to keep track of in the decoding step
is greedy: whether or not use the greedy decoding, default is false

Exact Commands for Hexatagging

The above commands can be used together with different taggers, models, and on different languages. To reproduce our Hexatagging results, here we put the exact commands used for training and evaluation of Hexatagger.

Train

PTB (English)

CUDA_VISIBLE_DEVICES=0 python run.py train --lang English --max-depth 6 --tagger hexa --model bert --epochs 50 --batch-size 32 --lr 2e-5 --model-path xlnet-large-cased --output-path ./checkpoints/ --use-tensorboard True
# model saved at ./checkpoints/English-hexa-bert-3e-05-50

CTB (Chinese)

CUDA_VISIBLE_DEVICES=0 python run.py train --lang Chinese --max-depth 6 --tagger hexa --model bert --epochs 50 --batch-size 32 --lr 2e-5 --model-path hfl/chinese-xlnet-mid --output-path ./checkpoints/ --use-tensorboard True
# model saved at ./checkpoints/Chinese-hexa-bert-2e-05-50

UD

CUDA_VISIBLE_DEVICES=0 python run.py train --lang bg --max-depth 6 --tagger hexa --model bert --epochs 50  --batch-size 32 --lr 2e-5 --model-path bert-base-multilingual-cased --output-path ./checkpoints/ --use-tensorboard True

Evaluate

PTB

python run.py evaluate --lang English --max-depth 10 --tagger hexa --bert-model-path xlnet-large-cased --model-name English-hexa-bert-3e-05-50 --batch-size 64 --model-path ./checkpoints/

CTB

python run.py evaluate --lang Chinese --max-depth 10 --tagger hexa --bert-model-path bert-base-chinese --model-name Chinese-hexa-bert-3e-05-50 --batch-size 64 --model-path ./checkpoints/

UD

python run.py evaluate --lang bg --max-depth 10 --tagger hexa --bert-model-path bert-base-multilingual-cased --model-name bg-hexa-bert-1e-05-50 --batch-size 64 --model-path ./checkpoints/

Predict

PTB

python run.py predict --lang English --max-depth 10 --tagger hexa --bert-model-path xlnet-large-cased --model-name English-hexa-bert-3e-05-50 --batch-size 64 --model-path ./checkpoints/

Citation

If you find this repository useful, please cite our papers:

@inproceedings{amini-cotterell-2022-parsing,
    title = "On Parsing as Tagging",
    author = "Amini, Afra  and
      Cotterell, Ryan",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-main.607",
    pages = "8884--8900",
}

@inproceedings{amini-etal-2023-hexatagging,
    title = "Hexatagging: Projective Dependency Parsing as Tagging",
    author = "Amini, Afra  and
      Liu, Tianyu  and
      Cotterell, Ryan",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-short.124",
    pages = "1453--1464",
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
learning		learning
pat-models		pat-models
tagging		tagging
README.md		README.md
const.py		const.py
header-hexa.png		header-hexa.png
header.jpg		header.jpg
load_dataset.sh		load_dataset.sh
requirements.txt		requirements.txt
run.py		run.py

rycolab/parsing-as-tagging

Folders and files

Latest commit

History

Repository files navigation