Span-Based Constituency Parser

This is an implementation of the span-based natural language constituency parser described in the paper Span-Based Constituency Parsing with a Structure-Label System and Provably Optimal Dynamic Oracles, which will appear in EMNLP (2016).

Required Dependencies

Python 2.7
NumPy
DyNet (To ensure compatibility, check out and compile commit 71fc893eda8e3f3fccc77b9a4ae942dce77ba368)

Vocabulary Files

Vocabulary may be loaded every time from a training tree file, or it may be stored (separately from a trained model) in a JSON file, which is much faster and recommended. To learn the vocabulary from a file with training trees and write a JSON file, use a command such as the following:

python src/main.py --train data/02-21.10way.clean --write-vocab data/vocab.json

Training

Training requires a file containing training trees (--train) and a file containg validation trees (--dev), which are parsed four times per training epoch to determine which model to keep. A file name must also be provided to store the saved model (--model). The following is an example of a command to train a model with all of the default settings:

python src/main.py --train data/02-21.10way.clean --dev data/22.auto.clean --vocab data/vocab.json --model data/my_model

The following table provides an overview of additional training options:

Argument	Description	Default
--dynet-mem	Memory (MB) to allocate for DyNet	2000
--dynet-l2	L2 regularization factor	0
--dynet-seed	Seed for random parameter initialization	random
--word-dims	Word embedding dimensions	50
--tag-dims	POS embedding dimensions	20
--lstm-units	LSTM units (per direction, for each of 2 layers)	200
--hidden-units	Units for ReLU FC layer (each of 2 action types)	200
--epochs	Number of training epochs	10
--batch-size	Number of sentences per training update	10
--droprate	Dropout probability	0.5
--unk-param	Parameter z for random UNKing	0.8375
--alpha	Softmax weight for exploration	1.0
--beta	Oracle action override probability	0.0
--np-seed	Seed for shuffling and softmax sampling	random

Test Evaluation

There is also a facility to directly evaluate a model agaist a reference corpus, by supplying the --test argument:

python src/main.py --test data/23.auto.clean --vocab data/vocab.json --model data/my_model

Citation

If you use this software for research, we would appreciate a citation to our paper:

@inproceedings{cross2016span,
  title={Span-Based Constituency Parsing with a Structure-Label System and Provably Optimal Dynamic Oracles},
  author={Cross, James and Huang, Liang},
  journal={Empirical Methods in Natural Language Processing (EMNLP)},
  year={2016}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

src

src

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Span-Based Constituency Parser

Required Dependencies

Vocabulary Files

Training

Test Evaluation

Citation

About

Releases

Packages

Languages

License

jhcross/span-parser

Folders and files

Latest commit

History

Repository files navigation

Span-Based Constituency Parser

Required Dependencies

Vocabulary Files

Training

Test Evaluation

Citation

About

Resources

License

Stars

Watchers

Forks

Languages