Here are some notes of the classification datasets.
It will be much easier to preprocess and load data via torchtext, check out its documentation here.
Here are statistics of some popular classification datasets:
Dataset | Classes | Train Samples | Test Samples | Total | Download |
---|---|---|---|---|---|
AG News | 4 | 120,000 | 7,600 | 127,600 | Goole Drive |
Sogou News (Chinese) | 5 | 450,000 | 60,000 | 510,000 | Goole Drive |
DBpedia | 14 | 560,000 | 70,000 | 630,000 | Goole Drive |
Yelp Review Polarity | 2 | 560,000 | 38,000 | 598,000 | Goole Drive |
Yelp Review Full | 5 | 650,000 | 50,000 | 700,000 | Goole Drive |
Yahoo Answers | 10 | 1,400,000 | 60,000 | 1,460,000 | Goole Drive |
Amazon Review Full | 5 | 3,000,000 | 650,000 | 3,650,000 | Goole Drive |
Amazon Review Polarity | 2 | 3,600,000 | 400,000 | 4,000,000 | Goole Drive |
IMDB | 2 | 25,000 | 25,000 | 50,000 | Link |
SST-2 | 2 | / | / | 94.2k | Link |
SST-5 | 5 | / | / | 56.4k | Link |
TREC | 6 / 50 | 5,452 | 500 | 5,952 | Link |
All of the following datasets can be downloaded here (Google Drive). They are proposed and described in this paper:
Character-level Convolutional Networks for Text Classification. Xiang Zhang, et al. NIPS 2015.
-
AG News
News articles, original data are from here.
4 Classes: 0: World, 1: Sports, 2: Business, 3: Sci/Tech
-
Sogou News (Chinese)
News articles from SogouCA and SogouCS (manually labeled using URLs).
5 Classes: 0: Sports, 1: Finance, 2: Entertainment, 3: Automobile, 4: Technology
-
DBpedia
Title and abstract of each Wikipedia article, original data are from here.
14 Classes: 0: Company, 1: Educational Institution, 2: Artist, 3: Athlete, 4: Office Holder, 5: Mean Of Transportation, 6: Building, 7: Natural Place, 8: Village, 9: Animal, 10: Plant, 11: Album, 12: Film, 13 : Written Work
-
Yelp Review Full
Reviews on Yelp, from Yelp Dataset Challenge 2015. Here is Yelp Dataset's homepage.
5 Classes: five levels of ratings from 0-4 (higher is better)
-
Yelp Review Polarity
Modified from Yelp Review Full, by considering stars 1, 2 negative, and 3, 4 positive.
2 Classes: 0: Negative polarity, 1: Positive polarity
-
Yahoo Answers
Question title, question content and best answer from Yahoo! Answers Comprehensive Questions and Answers version 1.0.
10 Classes: 0: Society & Culture, 1: Science & Mathematics, 2: Health, 3: Education & Reference, 4: Computers & Internet, 5: Sports, 6: Business & Finance, 7: Entertainment & Music, 8: Family & Relationships, 9: Politics & Government
-
Amazon Review Full
Reviews from Amazon, including title and content, original data are from here.
5 Classes: five levels of ratings from 0-4 (higher is better)
-
Amazon Review Polarity
Modified from Amazon Review Full, by considering stars 1, 2 negative, and 3, 4 positive.
2 Classes: 0: Negative polarity, 1: Positive polarity
-
Proposed in paper:
Learning Word Vectors for Sentiment Analysis. Andrew L. Maas, et al. ACL 2011.
2 Classes: Negative, Positive
samples: train: 25,000, test: 25,000
Description: Movie reviews, the ratings range from 1-10. A negative review has a score ≤ 4, and a positive review has a score ≥ 7.
-
Movie reviews.
-
SST-5 (Fine-grained)
5 Classes: Very Negative, Negative, Neutral, Positive, Very Positive
samples: 94.2k
-
SST-2 (Binary)
2 Classes: Negative, Positive
samples: 56.4k
Description: Same as SST-5 but with neutral reviews removed and binary labels.
-
-
A dataset for classifying questions into semantic categories.
samples: train: 5,452, test: 500
-
TREC-6
6 Classes: Abbreviation, Description, Entity, Human, Location, Numeric Value
-
TREC-50 (Fine-grained)
50 Classes
-