Skip to content

Latest commit

 

History

History
58 lines (44 loc) · 4.78 KB

part-of-speech_tagging.md

File metadata and controls

58 lines (44 loc) · 4.78 KB

Part-of-speech tagging

Part-of-speech tagging (POS tagging) is the task of tagging a word in a text with its part of speech. A part of speech is a category of words with similar grammatical properties. Common English parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, etc.

Example:

Vinken , 61 years old
NNP , CD NNS JJ

Penn Treebank

A standard dataset for POS tagging is the Wall Street Journal (WSJ) portion of the Penn Treebank, containing 45 different POS tags. Sections 0-18 are used for training, sections 19-21 for development, and sections 22-24 for testing. Models are evaluated based on accuracy.

Model Accuracy Paper / Source Code
Meta BiLSTM (Bohnet et al., 2018) 97.96 Morphosyntactic Tagging with a Meta-BiLSTM Model over Context Sensitive Token Encodings
Flair embeddings (Akbik et al., 2018) 97.85 Contextual String Embeddings for Sequence Labeling Flair framework
Char Bi-LSTM (Ling et al., 2015) 97.78 Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation
Adversarial Bi-LSTM (Yasunaga et al., 2018) 97.59 Robust Multilingual Part-of-Speech Tagging via Adversarial Training
Yang et al. (2017) 97.55 Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks
Ma and Hovy (2016) 97.55 End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF
LM-LSTM-CRF (Liu et al., 2018) 97.53 Empowering Character-aware Sequence Labeling with Task-Aware Neural Language Model
NCRF++ (Yang and Zhang, 2018) 97.49 NCRF++: An Open-source Neural Sequence Labeling Toolkit NCRF++
Feed Forward (Vaswani et a. 2016) 97.4 Supertagging with LSTMs
Bi-LSTM (Ling et al., 2017) 97.36 Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation
Bi-LSTM (Plank et al., 2016) 97.22 Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss

Social media

The Ritter (2011) dataset has become the benchmark for social media part-of-speech tagging. This is comprised of some 50K tokens of English social media sampled in late 2011, and is tagged using an extended version of the PTB tagset.

Model Accuracy Paper
FastText + CNN + CRF 90.53 Twitter word embeddings (Godin et al. 2019 (Chapter 3))
CMU 90.0 ± 0.5 Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters
GATE 88.69 Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data

UD

Universal Dependencies (UD) is a framework for cross-linguistic grammatical annotation, which contains more than 100 treebanks in over 60 languages. Models are typically evaluated based on the average test accuracy across 21 high-resource languages (♦ evaluated on 17 languages).

Model Avg accuracy Paper / Source
Multilingual BERT and BPEmb (Heinzerling and Strube, 2019) 96.77 Sequence Tagging with Contextual and Non-Contextual Subword Representations: A Multilingual Evaluation
Adversarial Bi-LSTM (Yasunaga et al., 2018) 96.65 Robust Multilingual Part-of-Speech Tagging via Adversarial Training
MultiBPEmb (Heinzerling and Strube, 2019) 96.62 Sequence Tagging with Contextual and Non-Contextual Subword Representations: A Multilingual Evaluation
Bi-LSTM (Plank et al., 2016) 96.40 Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss
Joint Bi-LSTM (Nguyen et al., 2017)♦ 95.55 A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing

Go back to the README