hoolock

hoolock is a Pytorch-based, GPU-friendly StackLSTM implementation that makes it much easier and faster to build a StackLSTM parser. To give you an idea, for Penn Treebank, the training takes about 30 minutes with batch size 256 on a single GTX-1080Ti GPU.

Dependencies

You can install all python dependencies by calling pip install -r requirements.txt. For others, you only have to download them.

Python 3.6
dill
PyTorch v1.0.0: any version after v0.4 should work, but we enforce version to eliminate randomness
pytorch-gradual-warmup-lr
six
torchtext v0.2.1: later versions may also work, but not tested
arc-swift: only for data pre and post-processing
Stanford POS Tagger: optional, just for reproducing the results in the paper

NOTE: Unfortunately, arc-swift seems to assume Python 2. So you might want to set up some conda/virtualenv environments or run alternative PTB data preparation script.

Usage

This short tutorial aims to help you reproduce result in the paper, which is on Penn Treebank (PTB), but the procedure to training and testing on other data should be similar.

Preprocessing

Follow the instructions in arc-swift to run the initial preprocessing on Penn Treebank. Their script will create standard data split on PTB, filter non-projective trees in training data, and generate oracle transition sequence. We will assume your preprocessed data (preprocessed conllx and oracle sequences) are all stored in a single directory referred to as ${data_dir}.

Setup Stanford POS Tagger and run

bash scripts/postag.sh ${data_dir}/ptb3-wsj-[train|dev|dev.proj|test].conllx

to generate data with Stanford POSTags. Your data is going to be stored in $PWD/data/postags

You'll also need a special word embedding used in Dyer et al. 2015 which you can download here.

The last step is to build vocabulary, integerize everything and dump them as a binary object:

mkdir -p binarized
python preprocess.py --train_conll_file $PWD/data/postags/ptb3-wsj-train.conllx.pos \
                     --train_oracle_file ${data_dir}/data/train.AH.seq \
                     --dev_conll_file $PWD/data/postags/ptb3-wsj-dev.proj.conllx.pos \
                     --dev_oracle_file ${data_dir}/data/dev.AH.seq \
                     --save_data $PWD/binarized/data+pre+inferpos.en.AH \
                     --pre_word_emb_file ${data_dir}/sskip.100.vectors \
                     --sent_len 150 --action_seq_len 300

Training

You can explore all the available options by running python train.py --help, but to reproduce the paper the default is sufficient:

python train.py --data_file $PWD/binarized/data+pre+inferpos.en.AH \
                --model_file model --gpuid [gpu id]

Parsing

Here is what we use to generate our output. You are free to adjust the batch size as you wish, which does not change the output. The stack size, on the other hand, may need to be changed from corpus to corpus, although it doesn't need to be the same as the one you used for training as it doesn't change the number of parameters. We find 150 to be enough for PTB.

python parse.py --input_file $PWD/data/postags/[conllx input]  \
                --model_file [model file] --output_file out.seq \
                --batch_size 80 --stack_size 150 \
                --data_file $PWD/binarized/data+pre+inferpos.en.AH \
                --pre_emb_file ${data_dir}/sskip.100.vectors

The parser will output a transition sequence, which is not very helpful. To convert this back to conllx format, run the following script:

python oracle2conll.py --fin out.seq --fout out.conllx --transSys AH \
                       --conllin $PWD/data/postags/[conllx input]

Finally, to evaluate arc F1 score, run the script from arc-swift (remember that you need to switch back to Python 2).

python $arc_swift/src/eval.py -g [reference] -s out.conllx

Note that while the preprocesed conllx input will have the parse of sentence, we are not looking at the parse during parsing and postprocessing. If you are parsing plain text file, you'll want to convert them into conllx format and put a placeholder for these fields.

Citation

TBD

Naming

According to Wikipedia, hoolock gibbons are generally found in Eastern Bangladesh, Northeast India and Southwest China. Benefiting from their special brachiating skills, they can travel up to 35mph between the trees, making them the fastest and most agile of all tree-dwelling, non-flying mammals.

Because of shrinking habitat and hunting, most species of the hoolock genus have been classified by IUCN as "Endangered" or "Vulnerable" (see here and here). You can find more information about this IUCN classification and donate to them to help combat the declining global biodiversity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model

model

scripts

scripts

utils

utils

LICENSE.md

LICENSE.md

README.md

README.md

oracle2conll.py

oracle2conll.py

parse.py

parse.py

preprocess.py

preprocess.py

requirements.txt

requirements.txt

train.py

train.py

Repository files navigation

hoolock

Dependencies

Usage

Preprocessing

Training

Parsing

Citation

Naming

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
model		model
scripts		scripts
utils		utils
LICENSE.md		LICENSE.md
README.md		README.md
oracle2conll.py		oracle2conll.py
parse.py		parse.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
train.py		train.py

License

shuoyangd/hoolock

Folders and files

Latest commit

History

Repository files navigation

hoolock

Dependencies

Usage

Preprocessing

Training

Parsing

Citation

Naming

About

Resources

License

Stars

Watchers

Forks

Languages