Skip to content

Latest commit

 

History

History
162 lines (100 loc) · 4.39 KB

topic_classification.md

File metadata and controls

162 lines (100 loc) · 4.39 KB

Chinese Text Classification

Background

Text classification assigns tags or categories to text according to its topical content, typically training on labeled documents. Topics are sometimes broad and akin to genre (news, sports, arts) but sometimes as fine-grained as hashtags.

Example input/output

Input:

[国足]有信心了 中国国奥队取得热身赛三连胜

Output:

Sports

Standard Metrics

  • Accuracy: the percentage of correctly classified samples.

THUCNews.

Sina News RSS subscription channel data from 2005 to 2011, which contains 74 million news documents (2.19 GB), 14 topics, all in UTF-8 plain text format.

Source # Classes Size(sentences)
THUCNews 14 740,000

Metrics

  • Accuracy

Results

Accuracy
J. Chen, C. Cao, X. Jiang 98.7%
Y. Song 97.56%
W. Liu, P. Zhou, et al 96.71%
S. Xin 96.04%
Sun, Baohua, et al 94.85%

SogouCS.

Sohu News from June to July 2012 in 18 channels.

Source # Classes Size(sentences)
Sougou news dataset 5 86,597

Metrics

  • Accuracy

Results

Error rate
Chung, Tonglee, et al 3.37%

Resources

Dataset Classes Train(samples size)
Sougou news dataset 5 490,717

Fudan corpus.

contains 9804 documents of long sentences and paragraphs in 20 categories.

Source # Classes Size(sentences)
Fudan corpus 5 1836

Metrics

  • Accuracy

Results

Accuracy
Sun, Baohua, et al 97.8%
Meng et al, 2019 96.3%

Resources

Source # Classes Size(sentences)
Fudan corpus 5 4284

Ifeng.

First paragraphs of Chinese news articles from 2006-2016 were evenly split into 5 news channels.

Source # Classes Size(sentences)
Ifeng 5 50,000

Metrics

  • Accuracy

Results

Accuracy
Meng et al, 2019 85.8%
Sun, Baohua, et al 84.4%
Zhang and Lecun 2017 83.7%

Resources

Dataset Classes Train(samples size)
Ifeng 5 800,000

Chinanews.

Chinese news articles from 2008- 2016 were evenly split into 7 news channels, removing duplicates.

Source # Classes Size(sentences)
Chinanews 7 112,000

Metrics

  • Accuracy

Results

Accuracy
Sun, Baohua, et al 92.0%
Meng et al, 2019 91.9%
Zhang and Lecun 2017 90.9%

Resources

Dataset Classes Train(samples size)
China news 7 1,400,000

Suggestions? Changes? Please send email to chinesenlp.xyz@gmail.com