mlt_thesis_open_sesame

based on open-sesame: https://github.com/swabhs/open-sesame

Installation

This project is developed using Python 2.7. Other requirements include the DyNet library, and some NLTK packages.

$ pip install dynet
$ pip install nltk
$ python -m nltk.downloader averaged_perceptron_tagger wordnet

Data Preprocessing

Data must be preprocessed into a format similar to CoNLL 2009, used by open-SESAME, but with BIO tags, for ease of reading, compared to the original XML format. See sample CoNLL formatting here.

Open-SESAME used a preprocess script/command which should be adapted to lingFN so users can preprocess the data by executing:

$ python -m sesame.preprocess

from open-SESAME "The above script writes the train, dev and test files in the required format into the data/neural/fn1.7/ directory. A large fraction of the annotations are either incomplete, or inconsistent. Such annotations are discarded, but logged under preprocess-fn1.7.log, along with the respective error messages."

This does NOT WORK yet and should be adapted to lingFN XML data

The data/ directory under the root directory contains the XML fndata-1.7/fulltext and CoNLL formatted data data/neural/ as well as the frames LUs and FEs for each frame in LingFN fndata-1.7/frame and fndata-1.7/frame_no_data_fes .
Second, this project uses pretrained GloVe word embeddings of 100 dimensions, trained on 6B tokens. Download and extract under data/.
Optionally, make alterations to the configurations in configurations/global_config.json, to use different pretrained embeddings, etc.

Training

Frame-semantic parsing involves target identification, frame identification and argument identification --- each step is trained independently of the others. To use the character-based model use the argument --character_based.

To train a model, execute:

$ python -m sesame.$MODEL --mode train --model_name $MODEL_NAME --character_based

The $MODELs are called argid (FE identification), frameid (Frame identification), and targetid (LU identification). Training saves the best model on validation data in the directory logs/$MODEL_NAME/best-$MODEL-1.7-model. The same directory will also save a configurations.json containing current model configuration.

If training gets interrupted, it can be restarted from the last saved checkpoint by specifying --mode refresh.

Pre-trained Models

The pretrained model from my MLT thesis is called character_bilstm_fixed_fulldata_PCA_space. To rename this, rename the directory in which the model file is contained.

Note According to open-SESAME there is a known open issue about pretrained models not being able to replicate the reported performance on a different machine. I did not experience this, but performance should be replicable with training and testing from scratch.

Test

The different models for target identification, frame identification and argument identification, need to be executed in that order. This means the argid model, for example, should be tested with given LUs and Frames. To test under a given model, execute this command, using --character_based for the character-based model.

$ python -m sesame.$MODEL --mode test --model_name $MODEL_NAME --character_based

The output, in a CoNLL 2009-like format will be written to logs/$MODEL_NAME/predicted-1.7-$MODEL-test.conll and in the frame-elements file format to logs/$MODEL_NAME/predicted-1.7-$MODEL-test.fes for frame and argument identification.

1. Target Identification

$MODEL = targetid

A bidirectional LSTM model takes into account the lexical unit index in FrameNet to identify targets. This model has not been described in the paper.

2. Frame Identification

$MODEL = frameid

Frame identification is based on a bidirectional LSTM model. Targets and their respective lexical units need to be identified before this step. At test time, example-wise analysis is logged in the model directory.

3. Argument (Frame-Element) Identification

$MODEL = argid

Argument identification is based on a segmental recurrent neural net, used as the baseline in the paper. Targets and their respective lexical units need to be identified, and frames corresponding to the LUs predicted before this step. At test time, example-wise analysis is logged in the model directory.

Prediction on unannotated data

For predicting targets, frames and arguments on unannotated data, pretrained models are needed. Input needs to be specified in a file containing one sentence per line. The following steps result in the full frame-semantic parsing of the sentences:

$ python -m sesame.targetid --mode predict --model_name $MODEL_NAME --raw_input $filename.conll
$ python -m sesame.frameid --mode predict --model_name $MODEL_NAME --raw_input $filename.conll
$ python -m sesame.argid --mode predict --model_name $MODEL_NAME --raw_input $filename.conll --character_based

The resulting frame-semantic parses will be written to logs/$MODEL_NAME/predicted-args.conll in the same CoNLL 2009-like format.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
configurations		configurations
data		data
fig		fig
logs		logs
sesame		sesame
LICENSE.md		LICENSE.md
README.md		README.md
TKvector.py		TKvector.py
explore_mistakes.py		explore_mistakes.py
preprocess-fn1.7.log		preprocess-fn1.7.log
sample.fn1.7.train.conll		sample.fn1.7.train.conll
setup.py		setup.py
visualize_embeddings.py		visualize_embeddings.py

License

danielfoster15/mlt_thesis_open_sesame

Folders and files

Latest commit

History

Repository files navigation

mlt_thesis_open_sesame

Installation

Data Preprocessing

Training

Pre-trained Models

Test

1. Target Identification

2. Frame Identification

3. Argument (Frame-Element) Identification

Prediction on unannotated data

About

Resources

License

Stars

Watchers

Forks

Languages