Skip to content

Latest commit

 

History

History
109 lines (89 loc) · 8.98 KB

chinese_word_segmentation.md

File metadata and controls

109 lines (89 loc) · 8.98 KB

Chinese Word Segmentation

Task

Chinese word segmentation is the task of splitting Chinese text (a sequence of Chinese characters) into words.

Example:

'上海浦东开发与建设同步' → ['上海', '浦东', '开发', ‘与', ’建设', '同步']

Systems

♠ marks the system that uses character unigram as input. ♣ marks the system that uses character bigram as input.

  • Tian et al. (2020): ZEN + key-value memory networks ♠
  • Huang et al. (2019): BERT + model compression + multi-criterial learing ♠
  • Yang et al. (2018): Lattice LSTM-CRF + BPE subword embeddings ♠♣
  • Ma et al. (2018): BiLSTM-CRF + hyper-params search♠♣
  • Yang et al. (2017): Transition-based + Beam-search + Rich pretrain♠♣
  • Zhou et al. (2017): Greedy Search + word context♠
  • Chen et al. (2017): BiLSTM-CRF + adv. loss♠♣
  • Cai et al. (2017): Greedy Search+Span representation♠
  • Kurita et al. (2017): Transition-based + Joint model♠
  • Liu et al. (2016): neural semi-CRF♠
  • Cai and Zhao (2016): Greedy Search♠
  • Chen et al. (2015a): Gated Recursive NN♠♣
  • Chen et al. (2015b): BiLSTM-CRF♠♣

Evaluation

Metrics

F1-score

Dataset

Chinese Treebank 6

Model F1 Paper / Source Code
Huang et al. (2019) 97.6 Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning
Tian et al. (2020) 97.3 Improving Chinese Word Segmentation with Wordhood Memory Networks Github
Ma et al. (2018) 96.7 State-of-the-art Chinese Word Segmentation with Bi-LSTMs
Yang et al. (2018) 96.3 Subword Encoding in Lattice LSTM for Chinese Word Segmentation Github
Yang et al. (2017) 96.2 Neural Word Segmentation with Rich Pretraining Github
Zhou et al. (2017) 96.2 Word-Context Character Embeddings for Chinese Word Segmentation
Chen et al. (2017) 96.2 Adversarial Multi-Criteria Learning for Chinese Word Segmentation Github
Liu et al. (2016) 95.5 Exploring Segment Representations for Neural Segmentation Models Github
Chen et al. (2015b) 96.0 Long Short-Term Memory Neural Networks for Chinese Word Segmentation Github

Chinese Treebank 7

Model F1 Paper / Source Code
Ma et al. (2018) 96.6 State-of-the-art Chinese Word Segmentation with Bi-LSTMs
Kurita et al. (2017) 96.2 Neural Joint Model for Transition-based Chinese Syntactic Analysis

AS

Model F1 Paper / Source Code
Tian et al. (2020) 96.6 Improving Chinese Word Segmentation with Wordhood Memory Networks Github
Huang et al. (2019) 96.6 Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning
Ma et al. (2018) 96.2 State-of-the-art Chinese Word Segmentation with Bi-LSTMs
Yang et al. (2017) 95.7 Neural Word Segmentation with Rich Pretraining Github
Cai et al. (2017) 95.3 Fast and Accurate Neural Word Segmentation for Chinese Github
Chen et al. (2017) 94.8 Adversarial Multi-Criteria Learning for Chinese Word Segmentation Github

CityU

Model F1 Paper / Source Code
Tian et al. (2020) 97.9 Improving Chinese Word Segmentation with Wordhood Memory Networks Github
Huang et al. (2019) 97.6 Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning
Ma et al. (2018) 97.2 State-of-the-art Chinese Word Segmentation with Bi-LSTMs
Yang et al. (2017) 96.9 Neural Word Segmentation with Rich Pretraining Github
Cai et al. (2017) 95.6 Fast and Accurate Neural Word Segmentation for Chinese Github
Chen et al. (2017) 95.6 Adversarial Multi-Criteria Learning for Chinese Word Segmentation Github

PKU

Model F1 Paper / Source Code
Huang et al. (2019) 96.6 Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning
Tian et al. (2020) 96.5 Improving Chinese Word Segmentation with Wordhood Memory Networks Github
Yang et al. (2017) 96.3 Neural Word Segmentation with Rich Pretraining Github
Ma et al. (2018) 96.1 State-of-the-art Chinese Word Segmentation with Bi-LSTMs
Yang et al. (2018) 95.9 Subword Encoding in Lattice LSTM for Chinese Word Segmentation Github
Cai et al. (2017) 95.8 Fast and Accurate Neural Word Segmentation for Chinese Github
Chen et al. (2017) 94.3 Adversarial Multi-Criteria Learning for Chinese Word Segmentation Github
Liu et al. (2016) 95.7 Exploring Segment Representations for Neural Segmentation Models Github
Cai and Zhao (2016) 95.7 Neural Word Segmentation Learning for Chinese Github

MSR

Model F1 Paper / Source Code
Tian et al. (2020) 98.4 Improving Chinese Word Segmentation with Wordhood Memory Networks Github
Ma et al. (2018) 98.1 State-of-the-art Chinese Word Segmentation with Bi-LSTMs
Huang et al. (2019) 97.9 Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning
Yang et al. (2018) 97.8 Subword Encoding in Lattice LSTM for Chinese Word Segmentation Github
Yang et al. (2017) 97.5 Neural Word Segmentation with Rich Pretraining Github
Cai et al. (2017) 97.1 Fast and Accurate Neural Word Segmentation for Chinese Github
Chen et al. (2017) 96.0 Adversarial Multi-Criteria Learning for Chinese Word Segmentation Github
Liu et al. (2016) 97.6 Exploring Segment Representations for Neural Segmentation Models Github
Cai and Zhao (2016) 96.4 Neural Word Segmentation Learning for Chinese Github

Go back to the README