Skip to content

Latest commit

 

History

History
53 lines (37 loc) · 1.48 KB

data_config_en.md

File metadata and controls

53 lines (37 loc) · 1.48 KB

Data configuration

Install torchtext for data processing

The datasets module currently contains:

  • Sentiment analysis: SST and IMDb
  • Question classification: TREC
  • Entailment: SNLI
  • Language modeling: WikiText-2
  • Machine translation: Multi30k, IWSLT, WMT14

Others are planned or a work in progress:

  • Question answering: SQuAD

The current need to configure the data collection

Glove

Download to the project's root directory under the folder vector_cache

Classification Datasets

File Structure

  • TextClassificationBenchmark
    • .data
      • imdb
        • aclImdb_v1.tar.gz
      • sst
        • trainDevTestTrees_PTB.zip
      • trec
        • train_5500.label
        • TREC_10.label
    • .vector_cache
      • glove.42B.300d.zip
      • glove.840B.300d.zip
      • glove.twitter.27B.zip
      • glove.6B.zip

More datasets and updates coming soon, please wait for us to update further