Skip to content
Giuseppe Attardi edited this page Aug 19, 2015 · 1 revision

Training

For training a POS tagger, you need to supply the word embeddings, their vocabulary and the training corpus annotated in tab-separated format, one token per line. The last field should be the POS tag. Sentences should be separated by an empty line. Word embeddings are accepted in three formats:

  1. SENNA, two separate files: lowercased vocabulary and embeddings
  2. polyglot (word2vectors), two separate files: vocabulary and embeddings
  3. word2vec, single file, containing initial line with counts and size, and then one word per line followed by its weights

You can optionally specify to use word suffixes as features. You can invoke training like this:

bin/dl-pos.py pos.dnn -t train.tsv \
  --vocab vocab.txt --vectors vectors.txt \
  --caps --suffix --suffixes \
  -e 40 -l 0.01 -w 5 -n 300 -v

Tagging

You can invoke the same script for tagging a file:

dl-pos.py pos.dnn < input

where pos.dnn is a model produced by training and input is a file containing one token per line with an empty line to separate sentences.

Usage

The full invocation options are:

dl-pos.py [-h] [-c FILE] [--threads THREADS] [-v] [-t TRAIN] [-w WINDOW] [-s EMBEDDINGS_SIZE] [-e ITERATIONS] [-l LEARNING_RATE] [-n HIDDEN] [--vocab VOCAB] [--vectors VECTORS] [--min-occurr MINOCCURR] [--load LOAD] [--variant VARIANT] [--caps [CAPS]] [--suffix [SUFFIX]] [--suffixes SUFFIXES] [--prefix [PREFIX]] [--prefixes PREFIXES] model

POS tagger using word embeddings.

positional arguments:
  model                 Model file to train/use.

optional arguments:
  -h, --help            show this help message and exit
  -c FILE, --config FILE
                        Specify config file
  --threads THREADS     Number of threads (default 1)
  -v, --verbose         Verbose mode

Train:
  -t TRAIN, --train TRAIN
                        File with annotated data for training.
  -w WINDOW, --window WINDOW
                        Size of the word window (default 5)
  -s EMBEDDINGS_SIZE, --embeddings-size EMBEDDINGS_SIZE
                        Number of features per word (default 50)
  -e ITERATIONS, --epochs ITERATIONS
                        Number of training epochs (default 100)
  -l LEARNING_RATE, --learning_rate LEARNING_RATE
                        Learning rate for network weights (default 0.001)
  -n HIDDEN, --hidden HIDDEN
                        Number of hidden neurons (default 200)

Embeddings:
  --vocab VOCAB         Vocabulary file, either read or created
  --vectors VECTORS     Embeddings file, either read or created
  --min-occurr MINOCCURR
                        Minimum occurrences for inclusion in vocabulary
  --load LOAD           Load previously saved model
  --variant VARIANT     Either "senna" (default), "polyglot" or "word2vec".

Extractors:
  --caps [CAPS]         Include capitalization features. Optionally, supply
  --suffix [SUFFIX]     Include suffix features. Optionally, supply the number
                        of features (default 5)
  --suffixes SUFFIXES   Load suffixes from this file
  --prefix [PREFIX]     Include prefix features. Optionally, supply the number
                        of features (default 0)
  --prefixes PREFIXES   Load prefixes from this file