GitHub - turian/crfchunking-with-wordrepresentations: Train a CRF for syntactic chunking (CoNLL2000), and use word representations

turian / crfchunking-with-wordrepresentations Public

Notifications You must be signed in to change notification settings
Fork 16
Star 43

Train a CRF for syntactic chunking (CoNLL2000), and use word representations

43 stars 16 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
scripts		scripts
.gitignore		.gitignore
.hgignore		.hgignore
README		README
README.batch		README.batch
TODO		TODO

Repository files navigation

CRF chunking with word representations
--------------------------------------

    scripts and steps by Joseph Turian

A standard baseline in NLP is chunking (shallow parsing), which was the
CoNLL 2000 shared task. Training a CRF is a standard approach to this task
(Sha + Pereira 2003). In fact, many CRF implementations include
instructions for reimplementing the Sha+Pereira chunker with identical
features:
    crfsgd: http://leon.bottou.org/projects/sgd
    crf++: http://crfpp.sourceforge.net/
    CRFsuite: http://www.chokkan.org/software/crfsuite/

We use CRFsuite because it makes it simple to modify the feature
generation code, so one can easily add new features.

We have instructions and scripts for how we add word representations
(Brown clusters and/or word embeddings) to the training.

INSTALLATION:
-------------

Download and install CRFsuite: http://www.chokkan.org/software/crfsuite/

You will need my common Python library:
    http://github.com/turian/common

Go into data/ and download the CoNLL train and test files:
    cd data/
    wget http://www.cnts.ua.ac.be/conll2000/chunking/train.txt.gz
    wget http://www.cnts.ua.ac.be/conll2000/chunking/test.txt.gz
    gunzip *.gz
    
Download word representations:
    cd representations/

    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c100-freq1.txt
    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c320-freq1.txt
    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c1000-freq1.txt
    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c3200-freq1.txt

    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-1750000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.EMBEDDING_SIZE%3d200.txt.gz
    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2030000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.EMBEDDING_SIZE%3d100.txt.gz
    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2270000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.txt.gz
    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2280000000.LEARNING_RATE%3d1e-08.EMBEDDING_LEARNING_RATE%3d1e-07.EMBEDDING_SIZE%3d25.txt.gz
    ln -s model-1750000000.LEARNING_RATE=1e-09.EMBEDDING_LEARNING_RATE=1e-06.EMBEDDING_SIZE=200.txt.gz cw-embeddings-200dim.txt.gz
    ln -s model-2030000000.LEARNING_RATE\=1e-09.EMBEDDING_LEARNING_RATE\=1e-06.EMBEDDING_SIZE\=100.txt.gz cw-embeddings-100dim.txt.gz 
    ln -s model-2270000000.LEARNING_RATE\=1e-09.EMBEDDING_LEARNING_RATE\=1e-06.txt.gz cw-embeddings-50dim.txt.gz
    ln -s model-2280000000.LEARNING_RATE\=1e-08.EMBEDDING_LEARNING_RATE\=1e-07.EMBEDDING_SIZE\=25.txt.gz cw-embeddings-25dim.txt.gz

    wget http://pylearn.org/turian/hlbl_reps_clean_1.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz
    wget http://pylearn.org/turian/hlbl_reps_clean_2.50d.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz
    ln -s hlbl_reps_clean_1.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz hlbl-embeddings-100dim.txt.gz
    ln -s hlbl_reps_clean_2.50d.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz hlbl-embeddings-50dim.txt.gz 



BATCH EVALUATIONS
-----------------

./scripts/train-and-evaluate.py -name baseline --dev --l2  2

WARNING: Everytime you change the --features parameter, you should also
change the --name.


NOTE:
-----

CRFsuite has benchmark results on the CoNLL shared task:
    http://www.chokkan.org/software/crfsuite/benchmark.html

However, I did not achievable achieve comparable F1 score on the CoNLL
test set until I used the following parameters:

Dev F1  Test F1  params
94.04     93.63     l2=2
94.03     93.65     l2=3.2, possible_transitions=1
94.15     93.73     l2=3.2, possible_transitions=1, possible_states=1
94.16     93.79     SGD, l2=3.2, possible_transitions=1, possible_states=1

I chose the l2 penalty on the dev set, which was a subset of the
training data.
I then used this l2 penalty and trained over the entire training set.