Skip to content

Latest commit

 

History

History
133 lines (72 loc) · 6.21 KB

datasets.md

File metadata and controls

133 lines (72 loc) · 6.21 KB

Datasets

Here are some notes of the classification datasets.

It will be much easier to preprocess and load data via torchtext, check out its documentation here.

Here are statistics of some popular classification datasets:

Dataset Classes Train Samples Test Samples Total Download
AG News 4 120,000 7,600 127,600 Goole Drive
Sogou News (Chinese) 5 450,000 60,000 510,000 Goole Drive
DBpedia 14 560,000 70,000 630,000 Goole Drive
Yelp Review Polarity 2 560,000 38,000 598,000 Goole Drive
Yelp Review Full 5 650,000 50,000 700,000 Goole Drive
Yahoo Answers 10 1,400,000 60,000 1,460,000 Goole Drive
Amazon Review Full 5 3,000,000 650,000 3,650,000 Goole Drive
Amazon Review Polarity 2 3,600,000 400,000 4,000,000 Goole Drive
IMDB 2 25,000 25,000 50,000 Link
SST-2 2 / / 94.2k Link
SST-5 5 / / 56.4k Link
TREC 6 / 50 5,452 500 5,952 Link

 

Text Classification

All of the following datasets can be downloaded here (Google Drive). They are proposed and described in this paper:

Character-level Convolutional Networks for Text Classification. Xiang Zhang, et al. NIPS 2015.

  • AG News

    News articles, original data are from here.

    4 Classes: 0: World, 1: Sports, 2: Business, 3: Sci/Tech

  • Sogou News (Chinese)

    News articles from SogouCA and SogouCS (manually labeled using URLs).

    5 Classes: 0: Sports, 1: Finance, 2: Entertainment, 3: Automobile, 4: Technology

  • DBpedia

    Title and abstract of each Wikipedia article, original data are from here.

    14 Classes: 0: Company, 1: Educational Institution, 2: Artist, 3: Athlete, 4: Office Holder, 5: Mean Of Transportation, 6: Building, 7: Natural Place, 8: Village, 9: Animal, 10: Plant, 11: Album, 12: Film, 13 : Written Work

  • Yelp Review Full

    Reviews on Yelp, from Yelp Dataset Challenge 2015. Here is Yelp Dataset's homepage.

    5 Classes: five levels of ratings from 0-4 (higher is better)

  • Yelp Review Polarity

    Modified from Yelp Review Full, by considering stars 1, 2 negative, and 3, 4 positive.

    2 Classes: 0: Negative polarity, 1: Positive polarity

  • Yahoo Answers

    Question title, question content and best answer from Yahoo! Answers Comprehensive Questions and Answers version 1.0.

    10 Classes: 0: Society & Culture, 1: Science & Mathematics, 2: Health, 3: Education & Reference, 4: Computers & Internet, 5: Sports, 6: Business & Finance, 7: Entertainment & Music, 8: Family & Relationships, 9: Politics & Government

  • Amazon Review Full

    Reviews from Amazon, including title and content, original data are from here.

    5 Classes: five levels of ratings from 0-4 (higher is better)

  • Amazon Review Polarity

    Modified from Amazon Review Full, by considering stars 1, 2 negative, and 3, 4 positive.

    2 Classes: 0: Negative polarity, 1: Positive polarity

 

Sentiment Analysis

  • IMDB

    Proposed in paper:

    Learning Word Vectors for Sentiment Analysis. Andrew L. Maas, et al. ACL 2011.

    2 Classes: Negative, Positive

    samples: train: 25,000, test: 25,000

    Description: Movie reviews, the ratings range from 1-10. A negative review has a score ≤ 4, and a positive review has a score ≥ 7.

  • SST

    Movie reviews.

    • SST-5 (Fine-grained)

      5 Classes: Very Negative, Negative, Neutral, Positive, Very Positive

      samples: 94.2k

    • SST-2 (Binary)

      2 Classes: Negative, Positive

      samples: 56.4k

      Description: Same as SST-5 but with neutral reviews removed and binary labels.

 

Question Classification

  • TREC

    A dataset for classifying questions into semantic categories.

    samples: train: 5,452, test: 500

    • TREC-6

      6 Classes: Abbreviation, Description, Entity, Human, Location, Numeric Value

    • TREC-50 (Fine-grained)

      50 Classes