Part-of-speech tagging for Treebank of Learner English corpora with Recurrent Neural Networks

Motivation

Part-of-speech (POS) tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. Wikipedia

POS tagging could be the fundamentals of many NLP/NLU tasks, such as Name Entity Recognition (NER) and Abstract Meaning Representation (AMR). In this project, I want to explore the state-of-the-art Recurrent Neural Network (RNN) based models for POS tagging. The following are the candidate models:

Long Short-Term Memory (LSTM)
Bidirectional LSTM (BI-LSTM)
LSTM with a Conditional Random Field (CRF) layer (LSTM-CRF)
Bidirectional LSTM with a CRF layer (BI-LSTM-CRF)

I will apply the above models on two tasks:

Continuous POS tagging with RNNs
POS resemblance between learners with different native language background

(Update 2018/04/18: task 2 is added)
(Update 2018/04/14: the BI-LSTM is added)
(Update 2018/04/12: the basic LSTM and task 1 is added)

Dataset

UD English-ESL/TLE is a collection of 5,124 English as a Second Language (ESL) sentences (97,681 words), manually annotated with POS tags and dependency trees in the Universal Dependencies formalism. Each sentence is annotated both in its original and error corrected forms. The annotations follow the standard English UD guidelines, along with a set of supplementary guidelines for ESL. The dataset represents upper-intermediate level adult English learners from 10 native language backgrounds, with over 500 sentences for each native language. The sentences were randomly drawn from the Cambridge Learner Corpus First Certificate in English (FCE) corpus. The treebank is split randomly to a training set of 4,124 sentences, development set of 500 sentences and a test set of 500 sentences. Further information is available at esltreebank.org.

Citation: (Berzak et al., 2016; Yannakoudakis et al., 2011)

Data Loader

I've built a data loader for this dataset. To use the data loader, you need to first install the CoNLL-U Parser built by Emil Stenström. The following is an example to use data_loader:

import data_loader

meta_list, data_list = data_loader.load_data(load_train=True, load_dev=True, load_test=True)

train_meta, train_meta_corrected, \
dev_meta, dev_meta_corrected, \
test_meta, test_meta_corrected = meta_list

train_data, train_data_corrected, \
dev_data, dev_data_corrected, \
test_data, test_data_corrected = data_list

Metadata

doc_id: filename (also learner ID) of the original xml file
sent: raw text of the sentence written by the leaner with error corrected tags
native_language: native language of the leaner
age_range: age range of the learner
score: exam score of the learner

Some observations:

"native_language" enables us to design tasks related to native language identificaiton.
"age_range" enables us to identify the learner's age based on his/her writing style.
"score" can help us to group learners into categories, such as Beginner, Intermediate, Expert, Fluent, Proficient. It enables us to discover the writing style and common mistakes of different groups of learners.

train_meta.head()

id	doc_id	sent	errors	native_language	age_range	score
1	doc2664	I was <ns type="S"><i>shoked</i><c>shocked</c>...	{'S': 2, 'RV': 1}	Russian	21-25	21.0
2	doc648	I am very sorry to say it was definitely not a...	{'RT': 1, 'MT': 1}	French	26-30	38.0
3	doc1081	Of course, I became aware of her feelings sinc...	{'AGQ': 1}	Spanish	16-20	36.0
4	doc724	I also suggest that more plays and films shoul...	{'RV': 1, 'FV': 1}	Japanese	21-25	33.0
5	doc567	Although my parents were very happy <ns type="...	{'FD': 1, 'RJ': 1, 'RT': 1, 'MT': 1}	Spanish	31-40	34.0

Sentence Format

In this project, we will only use "form" (words) and "upostag" (part-of-speech tags).

train_data[0]

id	form	lemma	upostag	xpostag	feats	head	deprel	deps	misc	meta_id
1	I	_	PRON	PRP	None	3	nsubj	None	None	1
2	was	_	VERB	VBD	None	3	cop	None	None	1
3	shoked	_	ADJ	JJ	None	0	root	None	None	1
4	because	_	SCONJ	IN	None	8	mark	None	None	1
5	I	_	PRON	PRP	None	8	nsubj	None	None	1
6	had	_	AUX	VBD	None	8	aux	None	None	1
7	alredy	_	ADV	RB	None	8	advmod	None	None	1
8	spoken	_	VERB	VBN	None	3	advcl	None	None	1
9	with	_	ADP	IN	None	10	case	None	None	1
10	them	_	PRON	PRP	None	8	nmod	None	None	1
11	and	_	CONJ	CC	None	8	cc	None	None	1
12	I	_	PRON	PRP	None	14	nsubj	None	None	1
13	had	_	AUX	VBD	None	14	aux	None	None	1
14	taken	_	VERB	VBN	None	8	conj	None	None	1
15	two	_	NUM	CD	None	16	nummod	None	None	1
16	autographs	_	NOUN	NNS	None	14	dobj	None	None	1
17	.	_	PUNCT	.	None	3	punct	None	None	1

RNN Models

In this project, we mainly use PyTorch to implement the RNN models. The following are what I've already implemented:

Long Short-Term Memory (LSTM)

Long short-term memory (LSTM) units (or blocks) are a building unit for layers of a recurrent neural network (RNN). A RNN composed of LSTM units is often called an LSTM network. A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell is responsible for "remembering" values over arbitrary time intervals; hence the word "memory" in LSTM. Wikipedia

The following is the high-level architecture for the LSTM model:

Bidirectional LSTM (BI-LSTM)

The BI-LSTM model is derived from Bidrectional RNN (BRNN) (Schuster and Paliwal, 1997).

The principle of BRNN is to split the neurons of a regular RNN into two directions, one for positive time direction (forward states), and another for negative time direction (backward states). Those two states’ output are not connected to inputs of the opposite direction states. By using two time directions, input information from the past and future of the current time frame can be used unlike standard RNN which requires the delays for including future information. Wikipedia

The BI-LSTM is based on BRNN but replaces the RNN units with LSTM units. The following is the high-level architecture for the BI-LSTM model:

Task 1: Continuous POS tagging with RNNs

Architecture

In this task, a POS tagger was trained with all train data (4124 sentences), validated with dev data (500 sentences), and tested with test data (500 sentences). The following is the architecture:

Word Features

We use the pre-trained Word2Vec model built with Google News corpus (3 million 300-dimension English word vectors). Although it might not be the best choice (e.g. Google News corpus might not be representative for the English Learner text), it's still a legitimate choice: 1) It saves my time to build a large dictionary which cover all words in the UD English-ESL/TLE corpus; 2) It saves my time and computing resources to build large/sparse unigram vectors for words, and I don't need to worry about dimension reduction for now; 3) 300-dim w2v vector is small enough for this task, and the dimension is fixed so the vector can be directly used in NN. 4) It's free and available on Google Drive :).

Experiments

Performance

The dataset was divided into train, dev, test sets. We used train and dev sets to observe the fluctuation of accuracy and loss during the training process of 1000 epochs. There are 17 different POS tags in this experiment. The prediction is considered as true postive only if it is the same as the actual POS tag. The optimizer of RNNs is Stochastic Gradient Descent (SGD) with different learning rate (lr). The loss function is Cross Entropy Loss.

The following is the best performance after 100 epochs:

lr = 0.5

Model	Train Accuracy	Dev Accuracy	Test Accuracy
LSTM	89.28%	83.90%	83.31%
BI-LSTM	93.25%	88.00%	88.00%

lr = 0.1

Model	Train Accuracy	Dev Accuracy	Test Accuracy
LSTM	73.77%	71.86%	70.9%
BI-LSTM	78.37%	76.17%	75.62%

The BI-LSTM model consistantly performs better than the LSTM model and achieve 88% in testing accuracy (lr=0.5).

Parameter Tuning

The following are train/dev accuracy and loss in 100 epochs:

lr = 0.5

lr = 0.1

According to the following figures, both LSTM and BI-LSTM are not apparent overfitting. BI-LSTM learned faster and better than LSTM model.

Task 2: POS resemblance between learners with different native language background

In this task, I would like to discover the POS resemblance between learners with different native language background. The basic hypothesis is that a person's writing style in English is subconsciously influeced by the grammar of his/her native language. For example, the basic sentence structure in English is (Subject+Verb+Object), but in Japanese is (Subject+Object+Verb). Moreover, some languages do not have strict rules about the grammatical order of words, but they have abundant morphemes to construct sentences.

In the following experiments, we use the train data in the dataset. Here are some stats of the train data regarding learner's native language background.

import data_loader
import pandas as pd

meta_list, data_list = data_loader.load_data(load_train=True, load_dev=False, load_test=False)
train_meta, train_meta_corrected = meta_list
train_data, train_data_corrected  = data_list

languages = train_meta["native_language"].unique()
print("# of Sentence: {}".format(len(train_meta)))

print("Sentence distribution:")
stats = []
for language in languages:
    stats.append(len(train_meta[train_meta["native_language"]==language]))
stats_df = pd.DataFrame(stats, columns=["# of sentences"], index=languages)
print(stats_df)

print("Author distribution:")
stats = []
for language in languages:
    stats.append(len(train_meta[train_meta["native_language"]==language]["doc_id"].unique()))
stats_df = pd.DataFrame(stats, columns=["# of authors"], index=languages)
print(stats_df)

stats = []
languages = train_meta["native_language"].unique()
print("Exam score stats:")
for language in languages:
    stats.append(train_meta[train_meta["native_language"]==language]["score"].describe()[['count', 'mean', 'std', 'max', 'min']])
stats_df = pd.DataFrame(stats, index=languages)
print(stats_df)

# of Sentence: 4124
Sentence distribution:
            # of sentences
Russian                427
French                 401
Spanish                428
Japanese               407
Chinese                414
Turkish                404
Portuguese             407
Korean                 413
German                 400
Italian                423
Author distribution:
            # of authors
Russian               81
French               131
Spanish              175
Japanese              81
Chinese               66
Turkish               73
Portuguese            68
Korean                84
German                69
Italian               76
Exam score stats:
            count       mean       std   max   min
Russian     427.0  26.288056  6.179166  40.0   9.0
French      401.0  27.630923  4.666738  40.0  17.0
Spanish     428.0  26.789720  5.349402  40.0  11.0
Japanese    407.0  27.547912  5.040432  39.0  15.0
Chinese     414.0  26.268116  6.210832  40.0  14.0
Turkish     404.0  27.834158  5.494389  39.0   7.0
Portuguese  407.0  27.791155  4.963723  39.0  11.0
Korean      413.0  25.980630  6.019355  40.0  12.0
German      400.0  27.725000  5.880546  40.0  13.0
Italian     423.0  28.699764  4.388392  38.0  20.0

We train BI-LSTM models (500 epochs, SGD learning rate=0.5) for sentences in every lanugage respectively, and then test the tagging accuracy on sentences in other languages. That is, we train a POS tagger based on sentences written by learners with Japanese native language background, and use the tagger to tag sentences written by learners with other native language background. The following are the results of POS tagging accuracy.

The diagonal numbers show how well the models fit their training data. Although it shows some models learned faster and some learned slower, unfortunately, so far there is no significant proof that any pair of languages is more or less similar with each other in the perspective of POS resemblance.

However, under the same experiment settings, we still learned some from the results:

Models trained by learners with Chinese, Portuguese, Korean and German native language background learn faster and perform better in POS tagging.
In some pairs of languages, there is higher difference between (train on language A -> test on language B) and (train on language B -> test on language A).

References

Berzak, Y., Kenney, J., Spadine, C., Wang, J. X., Lam, L., Mori, K. S., ... & Katz, B. (2016). Universal dependencies for learner English. arXiv preprint arXiv:1605.04278.
Yannakoudakis, H., Briscoe, T., & Medlock, B. (2011, June). A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 180-189). Association for Computational Linguistics.
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673-2681.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
dataset/UD_English-ESL		dataset/UD_English-ESL
figures		figures
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bi-lstm_lr-0.1_training_stats.csv		bi-lstm_lr-0.1_training_stats.csv
bi-lstm_lr-0.5_training_stats.csv		bi-lstm_lr-0.5_training_stats.csv
data_loader.py		data_loader.py
lstm_lr-0.1_training_stats.csv		lstm_lr-0.1_training_stats.csv
lstm_lr-0.5_training_stats.csv		lstm_lr-0.5_training_stats.csv
plots.ipynb		plots.ipynb
stats.ipynb		stats.ipynb
task1_doc2vec_lstm.py		task1_doc2vec_lstm.py
task2_doc2vec_lstm.py		task2_doc2vec_lstm.py
task2_result_epoch=100.csv		task2_result_epoch=100.csv
task2_result_epoch=500.csv		task2_result_epoch=500.csv
tasks.ipynb		tasks.ipynb

License

tomelf/RNN-POS-Tagger-TLE

Folders and files

Latest commit

History

Repository files navigation