Data configuration

Install torchtext for data processing

The datasets module currently contains:

Sentiment analysis: SST and IMDb
Question classification: TREC
Entailment: SNLI
Language modeling: WikiText-2
Machine translation: Multi30k, IWSLT, WMT14

Others are planned or a work in progress:

Question answering: SQuAD

The current need to configure the data collection

Glove

Download to the project's root directory under the folder vector_cache

Classification Datasets

Download IMDB dataset to .data/imdb
Download SST dataset to .data/sst
Download TREC Question Classification 2 dataset to .data/imdb

File Structure

TextClassificationBenchmark
- .data
  - imdb
    - aclImdb_v1.tar.gz
  - sst
    - trainDevTestTrees_PTB.zip
  - trec
    - train_5500.label
    - TREC_10.label
- .vector_cache
  - glove.42B.300d.zip
  - glove.840B.300d.zip
  - glove.twitter.27B.zip
  - glove.6B.zip

More datasets and updates coming soon, please wait for us to update further