Skip to content

tomelf/RNN-POS-Tagger-TLE

Repository files navigation

Part-of-speech tagging for Treebank of Learner English corpora with Recurrent Neural Networks

Motivation

Part-of-speech (POS) tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. Wikipedia

POS tagging could be the fundamentals of many NLP/NLU tasks, such as Name Entity Recognition (NER) and Abstract Meaning Representation (AMR). In this project, I want to explore the state-of-the-art Recurrent Neural Network (RNN) based models for POS tagging. The following are the candidate models:

  • Long Short-Term Memory (LSTM)
  • Bidirectional LSTM (BI-LSTM)
  • LSTM with a Conditional Random Field (CRF) layer (LSTM-CRF)
  • Bidirectional LSTM with a CRF layer (BI-LSTM-CRF)

I will apply the above models on two tasks:

  1. Continuous POS tagging with RNNs
  2. POS resemblance between learners with different native language background

(Update 2018/04/18: task 2 is added)
(Update 2018/04/14: the BI-LSTM is added)
(Update 2018/04/12: the basic LSTM and task 1 is added)

Dataset

UD English-ESL/TLE is a collection of 5,124 English as a Second Language (ESL) sentences (97,681 words), manually annotated with POS tags and dependency trees in the Universal Dependencies formalism. Each sentence is annotated both in its original and error corrected forms. The annotations follow the standard English UD guidelines, along with a set of supplementary guidelines for ESL. The dataset represents upper-intermediate level adult English learners from 10 native language backgrounds, with over 500 sentences for each native language. The sentences were randomly drawn from the Cambridge Learner Corpus First Certificate in English (FCE) corpus. The treebank is split randomly to a training set of 4,124 sentences, development set of 500 sentences and a test set of 500 sentences. Further information is available at esltreebank.org.

Citation: (Berzak et al., 2016; Yannakoudakis et al., 2011)

Data Loader

I've built a data loader for this dataset. To use the data loader, you need to first install the CoNLL-U Parser built by Emil Stenström. The following is an example to use data_loader:

import data_loader

meta_list, data_list = data_loader.load_data(load_train=True, load_dev=True, load_test=True)

train_meta, train_meta_corrected, \
dev_meta, dev_meta_corrected, \
test_meta, test_meta_corrected = meta_list

train_data, train_data_corrected, \
dev_data, dev_data_corrected, \
test_data, test_data_corrected = data_list

Metadata

  • doc_id: filename (also learner ID) of the original xml file
  • sent: raw text of the sentence written by the leaner with error corrected tags
  • native_language: native language of the leaner
  • age_range: age range of the learner
  • score: exam score of the learner

Some observations:

  • "native_language" enables us to design tasks related to native language identificaiton.
  • "age_range" enables us to identify the learner's age based on his/her writing style.
  • "score" can help us to group learners into categories, such as Beginner, Intermediate, Expert, Fluent, Proficient. It enables us to discover the writing style and common mistakes of different groups of learners.
train_meta.head()
id doc_id sent errors native_language age_range score
1 doc2664 I was <ns type="S"><i>shoked</i><c>shocked</c>... {'S': 2, 'RV': 1} Russian 21-25 21.0
2 doc648 I am very sorry to say it was definitely not a... {'RT': 1, 'MT': 1} French 26-30 38.0
3 doc1081 Of course, I became aware of her feelings sinc... {'AGQ': 1} Spanish 16-20 36.0
4 doc724 I also suggest that more plays and films shoul... {'RV': 1, 'FV': 1} Japanese 21-25 33.0
5 doc567 Although my parents were very happy <ns type="... {'FD': 1, 'RJ': 1, 'RT': 1, 'MT': 1} Spanish 31-40 34.0

Sentence Format

In this project, we will only use "form" (words) and "upostag" (part-of-speech tags).

train_data[0]
id form lemma upostag xpostag feats head deprel deps misc meta_id
1 I _ PRON PRP None 3 nsubj None None 1
2 was _ VERB VBD None 3 cop None None 1
3 shoked _ ADJ JJ None 0 root None None 1
4 because _ SCONJ IN None 8 mark None None 1
5 I _ PRON PRP None 8 nsubj None None 1
6 had _ AUX VBD None 8 aux None None 1
7 alredy _ ADV RB None 8 advmod None None 1
8 spoken _ VERB VBN None 3 advcl None None 1
9 with _ ADP IN None 10 case None None 1
10 them _ PRON PRP None 8 nmod None None 1
11 and _ CONJ CC None 8 cc None None 1
12 I _ PRON PRP None 14 nsubj None None 1
13 had _ AUX VBD None 14 aux None None 1
14 taken _ VERB VBN None 8 conj None None 1
15 two _ NUM CD None 16 nummod None None 1
16 autographs _ NOUN NNS None 14 dobj None None 1
17 . _ PUNCT . None 3 punct None None 1

RNN Models

In this project, we mainly use PyTorch to implement the RNN models. The following are what I've already implemented:

Long Short-Term Memory (LSTM)

Long short-term memory (LSTM) units (or blocks) are a building unit for layers of a recurrent neural network (RNN). A RNN composed of LSTM units is often called an LSTM network. A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell is responsible for "remembering" values over arbitrary time intervals; hence the word "memory" in LSTM. Wikipedia

The following is the high-level architecture for the LSTM model:

Task1_LSTM

Bidirectional LSTM (BI-LSTM)

The BI-LSTM model is derived from Bidrectional RNN (BRNN) (Schuster and Paliwal, 1997).

The principle of BRNN is to split the neurons of a regular RNN into two directions, one for positive time direction (forward states), and another for negative time direction (backward states). Those two states’ output are not connected to inputs of the opposite direction states. By using two time directions, input information from the past and future of the current time frame can be used unlike standard RNN which requires the delays for including future information. Wikipedia

The BI-LSTM is based on BRNN but replaces the RNN units with LSTM units. The following is the high-level architecture for the BI-LSTM model:

Task1_BILSTM

Task 1: Continuous POS tagging with RNNs

Architecture

In this task, a POS tagger was trained with all train data (4124 sentences), validated with dev data (500 sentences), and tested with test data (500 sentences). The following is the architecture:

Task1 Architecture

Word Features

We use the pre-trained Word2Vec model built with Google News corpus (3 million 300-dimension English word vectors). Although it might not be the best choice (e.g. Google News corpus might not be representative for the English Learner text), it's still a legitimate choice: 1) It saves my time to build a large dictionary which cover all words in the UD English-ESL/TLE corpus; 2) It saves my time and computing resources to build large/sparse unigram vectors for words, and I don't need to worry about dimension reduction for now; 3) 300-dim w2v vector is small enough for this task, and the dimension is fixed so the vector can be directly used in NN. 4) It's free and available on Google Drive :).

Experiments

Performance

The dataset was divided into train, dev, test sets. We used train and dev sets to observe the fluctuation of accuracy and loss during the training process of 1000 epochs. There are 17 different POS tags in this experiment. The prediction is considered as true postive only if it is the same as the actual POS tag. The optimizer of RNNs is Stochastic Gradient Descent (SGD) with different learning rate (lr). The loss function is Cross Entropy Loss.

The following is the best performance after 100 epochs:

  • lr = 0.5
Model Train Accuracy Dev Accuracy Test Accuracy
LSTM 89.28% 83.90% 83.31%
BI-LSTM 93.25% 88.00% 88.00%
  • lr = 0.1
Model Train Accuracy Dev Accuracy Test Accuracy
LSTM 73.77% 71.86% 70.9%
BI-LSTM 78.37% 76.17% 75.62%

The BI-LSTM model consistantly performs better than the LSTM model and achieve 88% in testing accuracy (lr=0.5).

Parameter Tuning

The following are train/dev accuracy and loss in 100 epochs:

  • lr = 0.5

Task1_Accu_lr0.5 Task1_Loss_lr0.5

  • lr = 0.1

Task1_Accu_lr0.1 Task1_Loss_lr0.1

According to the following figures, both LSTM and BI-LSTM are not apparent overfitting. BI-LSTM learned faster and better than LSTM model.

Task 2: POS resemblance between learners with different native language background

In this task, I would like to discover the POS resemblance between learners with different native language background. The basic hypothesis is that a person's writing style in English is subconsciously influeced by the grammar of his/her native language. For example, the basic sentence structure in English is (Subject+Verb+Object), but in Japanese is (Subject+Object+Verb). Moreover, some languages do not have strict rules about the grammatical order of words, but they have abundant morphemes to construct sentences.

In the following experiments, we use the train data in the dataset. Here are some stats of the train data regarding learner's native language background.

import data_loader
import pandas as pd

meta_list, data_list = data_loader.load_data(load_train=True, load_dev=False, load_test=False)
train_meta, train_meta_corrected = meta_list
train_data, train_data_corrected  = data_list
languages = train_meta["native_language"].unique()
print("# of Sentence: {}".format(len(train_meta)))

print("Sentence distribution:")
stats = []
for language in languages:
    stats.append(len(train_meta[train_meta["native_language"]==language]))
stats_df = pd.DataFrame(stats, columns=["# of sentences"], index=languages)
print(stats_df)

print("Author distribution:")
stats = []
for language in languages:
    stats.append(len(train_meta[train_meta["native_language"]==language]["doc_id"].unique()))
stats_df = pd.DataFrame(stats, columns=["# of authors"], index=languages)
print(stats_df)

stats = []
languages = train_meta["native_language"].unique()
print("Exam score stats:")
for language in languages:
    stats.append(train_meta[train_meta["native_language"]==language]["score"].describe()[['count', 'mean', 'std', 'max', 'min']])
stats_df = pd.DataFrame(stats, index=languages)
print(stats_df)
# of Sentence: 4124
Sentence distribution:
            # of sentences
Russian                427
French                 401
Spanish                428
Japanese               407
Chinese                414
Turkish                404
Portuguese             407
Korean                 413
German                 400
Italian                423
Author distribution:
            # of authors
Russian               81
French               131
Spanish              175
Japanese              81
Chinese               66
Turkish               73
Portuguese            68
Korean                84
German                69
Italian               76
Exam score stats:
            count       mean       std   max   min
Russian     427.0  26.288056  6.179166  40.0   9.0
French      401.0  27.630923  4.666738  40.0  17.0
Spanish     428.0  26.789720  5.349402  40.0  11.0
Japanese    407.0  27.547912  5.040432  39.0  15.0
Chinese     414.0  26.268116  6.210832  40.0  14.0
Turkish     404.0  27.834158  5.494389  39.0   7.0
Portuguese  407.0  27.791155  4.963723  39.0  11.0
Korean      413.0  25.980630  6.019355  40.0  12.0
German      400.0  27.725000  5.880546  40.0  13.0
Italian     423.0  28.699764  4.388392  38.0  20.0

We train BI-LSTM models (500 epochs, SGD learning rate=0.5) for sentences in every lanugage respectively, and then test the tagging accuracy on sentences in other languages. That is, we train a POS tagger based on sentences written by learners with Japanese native language background, and use the tagger to tag sentences written by learners with other native language background. The following are the results of POS tagging accuracy.

Task2_Stats

The diagonal numbers show how well the models fit their training data. Although it shows some models learned faster and some learned slower, unfortunately, so far there is no significant proof that any pair of languages is more or less similar with each other in the perspective of POS resemblance.

However, under the same experiment settings, we still learned some from the results:

  • Models trained by learners with Chinese, Portuguese, Korean and German native language background learn faster and perform better in POS tagging.
  • In some pairs of languages, there is higher difference between (train on language A -> test on language B) and (train on language B -> test on language A).

References

  1. Berzak, Y., Kenney, J., Spadine, C., Wang, J. X., Lam, L., Mori, K. S., ... & Katz, B. (2016). Universal dependencies for learner English. arXiv preprint arXiv:1605.04278.
  2. Yannakoudakis, H., Briscoe, T., & Medlock, B. (2011, June). A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 180-189). Association for Computational Linguistics.
  3. Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673-2681.