CRF-LSTM-NER

A CRF-BiLSTM model aims at quick and convenient benchmarking the performances of different word embeddings on your own corpus.

the objectives of this model are:

Build a CRF-BiLSTM Network in Tensorflow with methods for easisly switching among different word embeddings (Word2vec, GloVe, Fasttext, ELMo, Flair and any combinations of them) while keep the same CRF-LSTM Network unchanged.
Methods for easily gridsearch on the suitable parameters.

Requirements

Python 3, TensorFlow 1.0+, Gensim, and Flair(optinal):

if need to use "Contextual" embedding, Flair library(https://github.com/zalandoresearch/flair) is required.

How To Use

Modify the directory to the Cropus and Configure the Hyper-parameter accordingly in config.py

    # embeddings_size
    dim_word = 300
    dim_char = 50
    #
    hidden_size_char = 64 # lstm on chars
    hidden_size_lstm = 128 # lstm on word embeddings

    # dataset
    path_data_root = 'data/CoNLL2003/'
    path_train = path_data_root +'eng.testa'
    path_eval = path_data_root +'eng.testa'
    path_test = path_data_root +'eng.testb'

Designate the embedding you want. Since different embeddings come with different file formats, this part maybe vary slightly accordding to the embedding you choose. there is a example for them in "How To Use.ipynb"

    # glove
	config = Config('glove')
	glove_file_path = 'data/glove/glove.6B.100d.txt'
	config.init_glove(glove_file_path)

    # fasttext
	config = Config('fasttext')
	command ='../fastText/fasttext'
	bin_file ='../fastText/data/cc.en.300.bin'
	config.init_fasttext(command, bin_file)

Parse the corpus and generate the "index" and "input". the following code will base on the vocabularies of embedding and corpus to generate the index for token/character/label and the mapping the each sentence into a sequence of index. this part also handle the sepcific configuration of model base on the corpus, like the number of kind of label, the number of unique character in corpus.

# parse the corpus and generate the input data
token2idx, char2idx, label2idx, lookup_table = get_idx(config)
train_x, train_y = get_inputs('train', token2idx, char2idx, label2idx, config)
eval_x, eval_y = get_inputs('eval', token2idx, char2idx, label2idx, config)
test_x, test_y = get_inputs('test', token2idx, char2idx, label2idx, config)

initial the model's graph and train/eval/test.

# initial the same NER model 
ner_model = Model(config)
ner_model.build_graph()
ner_model.initialize_session()

resutl: the F1 score based on the label will be print and detail of training processing can be find in "./output/log.log".

you could find more detail in "How To Use.ipynb"

Reference

This model is based on the following papers:

Lample, Guillaume, et al. "Neural architectures for named entity recognition." arXiv preprint arXiv:1603.01360 (2016).
Zhiheng Huang, et al. "Bidirectional LSTM-CRF Models for Sequence Tagging." arXiv preprint arXiv:1508.01991 (2015).

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
config_examples		config_examples
output		output
test		test
.gitignore		.gitignore
How To Use.ipynb		How To Use.ipynb
README.md		README.md
config.py		config.py
conlleval		conlleval
main.py		main.py
model.py		model.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config_examples

config_examples

output

output

test

test

.gitignore

.gitignore

How To Use.ipynb

How To Use.ipynb

README.md

README.md

config.py

config.py

conlleval

conlleval

main.py

main.py

model.py

model.py

utils.py

utils.py

Repository files navigation

CRF-LSTM-NER

Requirements

How To Use

Reference

About

Releases

Packages

Languages

JZ-LIANG/CRF-LSTM-NER

Folders and files

Latest commit

History

Repository files navigation

CRF-LSTM-NER

Requirements

How To Use

Reference

About

Resources

Stars

Watchers

Forks

Languages