GitHub - langfield/tagger-v

This fork.

In this repo I've just made a single small modification to glample's NER tagger: a toggle to turn on and off the trainability of the pretrained embedding matrix. This will essentially freeze the state of the word vectors you use as a starting point for the model (if you use any). The purpose of this is for evaluation of said embeddings (if we allow the train.py workflow to modify this matrix, we don't know whether the accuracy is just a result of the model's ability to retrain and correct errors, or if the pretrained embeddings are actually contributing to the model). I'm not even 100% sure that this feature doesn't already exist in this codebase, but I wasn't able to find it and didn't want to spend any more time searching for it. I've also added more comments and examples in one or more of the scripts.

NER Tagger

NER Tagger is an implementation of a Named Entity Recognizer that obtains state-of-the-art performance in NER on the 4 CoNLL datasets (English, Spanish, German and Dutch) without resorting to any language-specific knowledge or resources such as gazetteers. Details about the model can be found at: http://arxiv.org/abs/1603.01360

Initial setup

To use the tagger, you need Python 2.7, with Numpy and Theano installed.

Tag sentences

The fastest way to use the tagger is to use one of the pretrained models:

./tagger.py --model models/english/ --input input.txt --output output.txt

The input file should contain one sentence by line, and they have to be tokenized. Otherwise, the tagger will perform poorly.

Train a model

To train your own model, you need to use the train.py script and provide the location of the training, development and testing set:

./train.py --train train.txt --dev dev.txt --test test.txt

The training script will automatically give a name to the model and store it in ./models/ There are many parameters you can tune (CRF, dropout rate, embedding dimension, LSTM hidden layer size, etc). To see all parameters, simply run:

./train.py --help

Input files for the training script have to follow the same format than the CoNLL2003 sharing task: each word has to be on a separate line, and there must be an empty line after each sentence. A line must contain at least 2 columns, the first one being the word itself, the last one being the named entity. It does not matter if there are extra columns that contain tags or chunks in between. Tags have to be given in the IOB format (it can be IOB1 or IOB2).

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
dataset		dataset
evaluation		evaluation
NER.sh		NER.sh
README.md		README.md
affine-outputlist__source--glove.txt		affine-outputlist__source--glove.txt
clean.py		clean.py
loader.py		loader.py
model.py		model.py
nn.py		nn.py
optimization.py		optimization.py
orig_sources_list.txt		orig_sources_list.txt
preprocessing.py		preprocessing.py
preprocessing.py.bak		preprocessing.py.bak
preprocessing.py.python3version		preprocessing.py.python3version
ran.py		ran.py
rands_list.txt		rands_list.txt
run.sh		run.sh
run_rand.sh		run_rand.sh
tagger.py		tagger.py
test_list.txt		test_list.txt
testtext.txt		testtext.txt
train.py		train.py
utils.py		utils.py

langfield/tagger-v

Folders and files

Latest commit

History

Repository files navigation

This fork.

NER Tagger

Initial setup

Tag sentences

Train a model

About

Resources

Stars

Watchers

Forks

Languages