Skip to content

andreaskuster/ner-and-pos-when-nothing-is-capitalized

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

University of Washington CSE 517 A Wi 20: Natural Language Processing - Project

As part of the CSE517 NLP class at UW, we seek to reproduce the results from the paper ner and pos when nothing is capitalized , provide a working implementation, insight into hyperparameters as well as conducting additional experiments.

Project Proposal

Final Paper

ner and pos when nothing is capitalized

For those languages which use it, capitalization is an important signal for the fundamental NLP tasks of Named Entity Recognition (NER) and Part of Speech (POS) tagging. In fact, it is such a strong signal that model performance on these tasks drops sharply in common lowercased scenarios, such as noisy web text or machine translation outputs. In this work, we perform a systematic analysis of solutions to this problem, modifying only the casing of the train or test data using lowercasing and truecasing methods. While prior work and first impressions might suggest training a caseless model, or using a truecaser at test time, we show that the most effective strategy is a concatenation of cased and lowercased training data, producing a single model with high performance on both cased and uncased text. As shown in our experiments, this result holds across tasks and input representations. Finally, we show that our proposed solution gives an 8% F1 improvement in mention detection on noisy out-of-domain Twitter data.

Paper

Findings

Truecasing experiment

Paper Reproduction (BiLSTM on Wikipedia)

Hypothesis

We expect to get similar results to those described in the paper.

Comparsion

Test Set F1 Score (OOV when creating dictionary) F1 Score (OOV at read) F1 Score from the paper
Wikipedia 92.71 92.65 93.01
ConLL Train 65.32 66.03 78.85
ConLL Test 63.28 63.49 77.35
PTB 01-18 78.73 78.53 86.91
PTB 22-24 78.69 78.47 86.22

While the paper does not provide a lot of detail on implementation, we were able to reproduce results shown in it closely enough to be confident in our implementation.

In particular we used an Adam optimizer (on default settings) and standard bidirectional, two layered LSTM with 300 hidden units. We used a batch size of 100, as metioned in [Susanto]. Then the model was trained for 30 epochs, and model with the smallest loss on validation set is chosen. That models loss on test set is reported above, as well as that model is made available for both NER and POS experiments.

Instead of using pre-trained encodings we are learning our own (since number of unique characters in train set is around 50). Due to the fact that both validation and test sets contain tokens not included in train set we are forced to use OOV tokens. Each time a sentence is read each character has a 0.5% chance of becoming an OOV token.

Lastly, to greatly increase training speed, all of our sentences are padded. Each padded has a target 0 (i.e. should not be capitalized), which counts towards training loss. However, padding does not count for either validation loss used to choose model epoch, and is not included in F1 score reported above.

Conclusion

On wikipedia dataset which we trained truecaser on we got performence similar to the one reported in the paper. The original procedure differed slightly from our version (mostly in a sense that it used char-rnn, while we used vanilla PyTorch), which can explain slight differences in results. However, achieving similar performence was not problematic with all information provided in the paper.

POS Experiment

Paper Reproduction (BiLSTM-CRF+POS+ELMo on PTB)

Hypothesis

  1. Training on cased data does not perform well on uncased data, while training on uncased data performs well on uncased data.
  2. Training on a concatenated dataset of uncased and cased data performs well on cased and uncased data. It does so, not due to the larger dataset, but rather works as good if we (randomly) lowercase 50% of the dataset.
  3. Trying to solve the problem of (1) by truecasing the lowercased test data does not perform well, but it does perform well if the training data has been lowercased and truecased too.

We expected to get similar results to those described in the paper.

Comparison

Experiment Train Data Test Data Accuracy Avg Accuracy (Paper) Avg (Paper)
1.1 Cased Cased 97.30 - 97.85 -
1.2 Cased Uncased 88.29 92.78 88.66 93.26
2 Uncased Uncased 96.51 96.51 97.45 97.45
3.1 C+U Cased 97.51 - 97.79 -
3.2 C+U Uncased 96.59 97.05 97.35 97.57
3.5.1 Half Mixed Cased 97.12 - 97.85 -
3.5.2 Half Mixed Uncased 96.19 96.66 97.36 97.61
4 Cased Truecase 95.04 95.04 95.21 95.21
5 Truecase Truecase 96.61 96.61 97.38 97.38

Model Characteristics

Train/Test/Dev data:

  • Train: Penn Treebank section 0-18 (usage: training)
  • Dev: Penn TreeBank section 19-21 (usage: validation, i.e. early stopping and hyperparameter search)
  • Test: Penn TreeBank section 22-24 (usage: reporting accuracy)

Pre-Processing: Depending on the experiment, we either used the imported data as-is (cased), lower-cased it (lowercase) or lower-cased it and then true-case predicted it (truecase). Furthermore, we also combinded the lowercased and cased dataset (C+U) and randomly lowercased 50% of the dataset (half mixed).

Padding: Pad all sentence to the lenght of the longest sentence using "NULL" charakters. (Note: The reported accuracy values are true accuracies, i.e. with padding removed. If we would not do this and the dataset contains very short and very long sentences, the accuracy of a prediction with only "NULL" characters for the whole sequence of the short sentence would be very high, even though the predictor is very bad.)

Embedding: ELMo word embedding, vector size: 1024

LSTM Model: We used keras for the neural network implementation with the following configuration:

  • Sequential Model:
  • Input Layer:
    • BiLSTM Layer:
      • input shape: (max_sentence_lengt, 1024)
      • hidden units: 512
      • lstm dropout: 0.0
      • lstm recurrent dropout: 0.0
  • Hidden Layer(s):
    • TimeDistributed Dense Layer
      • shape: num_labels
      • activation function: rectified linear unit
  • Output Layer:
    • CRF Layer:
      • shape: num_labels
  • Training:
    • Solver:
      • Adam
      • learning rate: 0.001
    • Epochs:
      • max: 40 epochs
      • early stopping: stops after the validation accuracy did not increase by more than 0.001 over 4 conecutive epochs

Evaluation: After training, we predicted the label of the test set using the trained model, removed the padding and computed the accuracy (number of correctly predicted labels / number of labels).

Note: All the hyperparameters reported above are a result of the hyperparameter grid-search done previous to this experiment evaluation. Details about this and additional information about the code usage can be found here.

Conclusion

Even though we haven't seen any implementation details (except those described in the paper text), we reported the exact same accuracy scores (+/- 0.8%) and can confirm the hypothesis (1., 2., 3.) described above.

Additional Experiments

The aim of the additional experiments is to find out if the hypothesis from the paper is more generally applicable. Therefore, we run the same experiments on LSTM models with different word embeddings (word2vec, glove, elmo) and without the CRF layer. Furthermore, we extended the tests to different datasets, namely the Brown corpus, the CoNLL2000 corpus and a subset of the PTB corpus (train: section 0-4, dev: section 5-6, test: section 7-8).

Hypothesis

  1. For the coparison of the different embeddings (word2vec, glove, elmo) we expect elmo to perform better than the other two. Firstly, because of the better perfomance we concluded during the lecture (for a well trained elmo model) and secondly because it was the choice for this paper.
  2. We expect the ELMo model with CRF layer to outperform the one without due to the findings from the paper linked below.
  3. We expect to be able to conclude the hypothesis from the original experiments for the Twitter dataset as well

POS on penn treebank reduced dataset, brown and CoNLL2000 dataset, word2vec, glove and elmo, with/without CRF layer (additional experiments)

Train Data Test Data Accuracy Word2vec CRF Accuracy GloVe CRF Accuracy ELMo Accuracy ELMo CRF (paper experiment)
PTB Dataset
Cased Cased 88.80 95.90 97.19 97.30
Cased Uncased 78.63 86.11 88.57 88.29
Uncased Uncased 80.97 94.97 96.52 96.51
C+U Cased 85.62 96.88 97.44 97.51
C+U Uncased 86.67 95.84 96.60 96.59
Half Mixed Cased 87.45 95.79 97.30 97.12
Half Mixed Uncased 82.86 94.90 96.36 96.19
Cased Truecase 85.74 93.82 94.78 95.04
Truecase Truecase 86.64 95.20 96.56 96.61
Train Data Test Data Accuracy ELMo CRF
PTB Reduced Dataset
Cased Cased 96.35
Cased Uncased 88.38
Uncased Uncased 95.48
C+U Cased 96.70
C+U Uncased 95.73
Half Mixed Cased 96.34
Half Mixed Uncased 95.08
Cased Truecase 94.62
Truecase Truecase 95.35
Brown
Cased Cased 95.69
Cased Uncased 83.30
Uncased Uncased 92.91
C+U Cased 97.11
C+U Uncased 95.83
Half Mixed Cased 95.28
Half Mixed Uncased 92.56
Cased Truecase 92.11
Truecase Truecase 92.62
CoNLL 2000
Cased Cased 97.80
Cased Uncased 87.91
Uncased Uncased 96.83
C+U Cased 99.00
C+U Uncased 99.46
Half Mixed Cased 97.65
Half Mixed Uncased 96.66
Cased Truecase 95.40
Truecase Truecase 96.79

Model Characteristics

In order to compare the outcome from the additional experiments to those from the paper reproduction, we used the exact same model, except the component specified in the table (i.e. ELMo embedding replaced with GloVe embedding).

Conclusion

  1. ELMo indeed outperforms word2vec by ~10% and GloVe by ~2%.
  2. The accuracy with and without CRF layer are rougly the same (+/-0.3% depending on the testcase). Considering the extra effort of addin the (non-standard) keras layer, the lack of multi gpu support, we conclude that we could would probably be better off withouth the CRF layer. 3.The hypothesis (1.,2.,3.) from the paper hold true (relative accuracy difference) for the reduced ptb dastaset, as well as the other datasets (brown and conll2000). Depending on the dataset, the absolute accuracy values differ i.e. the performance of the pos tagger for CoNLL2000 reaches >99% for the C+U experiment.

Implementation details

Additional details about the part-of-speech tagging part of the paper can be found in the separate pos/README.md.

NER experiment

Hypothesis

  1. Training on cased data does not perform well on uncased data, while training on uncased data performs well on uncased data.
  2. Training on a concatenated dataset of uncased and cased data performs well on cased and uncased data. It does so, not due to the larger dataset, but rather works as good if we (randomly) lowercase 50% of the dataset.
  3. Trying to solve the problem of (1) by truecasing the lowercased test data does not perform well, but it does perform well if the training data has been lowercased and truecased too.

Model

BiLSTM-CRF using Glove + character embeddings trained on CoNLL 25 character embeddings trained using a BiLSTM + 300 pre-trained glove embeddings BiLSTM Layer with 200 hidden layer dimension (Drop out of 0.5) Highway layer Output CRF layer initial learning rate of 0.15 with adam stopping criterion when the F-score doesn't improve over iterations

The model is validated on eng_testa and tested on end_testb

Experiment Train data Test data F1 Score Avg F1 Score from the paper Avg from the paper
1.1 Cased Cased 90.63 - 92.45 -
1.2 Cased Uncased 81.47 86.05 34.46 63.46
2 Uncased Uncased 89.72 89.72 89.32 89.32
3.1 Augment Cased 90.10 - 91.67 -
3.2 Augment Uncased 88.65 89.38 89.31 90.49
3.5.1 Half Mixed Cased 90.84 91.68 -
3.5.2 Half Mixed Uncased 89.54 90.19 89.05 90.37
4 Cased Truecase 80.89 80.89 82.93 82.93
5 Truecase Truecase 88.43 88.43 90.25 90.25

BiLSTM-CRF using Glove + character embeddings trained on CoNLL tested on Twitter Corpus

Experiment Train data F1 Score F1 Score from the paper
1.1 Cased 33.24 58.63
2 Uncased 14.54 53.13
3 Augment 31.31 66.14
3.5 Half Mixed 32.94 64.69
4 Cased-Truecase 23.45 58.22
5 Truecase 29.19 62.66

Implications

The most interesting difference in the cased variant, where there is a above 30% gap between our and the original implementation. After closer investigation we discovered that the reason for it is huge different in its performance on uncased data (81.47 in our implementation vs 34.46 in original one). We do not have a firm intuition on why this is happening. However, it might be the case that models trained on cased dataset are highly unstable when tested on uncased data.

Overall, however we can see that relative performance of results is similar, and mixing cased and uncased data provides the best performance with our implementation. Because of this we believe our results support second hypothesis of the paper.

Our model did much worse than the original one. This is very counterintuitive, when we consider the fact that in the original dataset, our cased experiment generalized much better. Overall, we cannot support the thrid hypothesis from original paper.

Citation

You can cite this unpublished paper:

@unpublished{ nlp_ner_pos_capitalization,
  author    = {Kuster, Andreas and Filipek, Jakub and Muppirala, Viswa Virinchi},
  title     = {reproducing "ner and pos when nothing is capitalized"},
  year      = {2020}
  url       = {https://arxiv.org/abs/2109.08396}
}

References

ner and pos when nothing is capitalized

ner and pos when nothing is capitalized poster

Bidirectional LSTM-CRF Models for Sequence Tagging

Deep contextualized word representations

GloVe: Global Vectors for Word Representation

Learning to Capitalize with Character-Level Recurrent Neural Networks: An Empirical Study