Skip to content

melanietosik/bow-sentiment-classifier

Repository files navigation

Bag-of-words sentiment classifier

Natural Language Processing with Representation Learning (DS-GA 1011)

For the full write-up of the results, please see report.pdf.

Requirements

$ #(HPC) module load anaconda3/4.3.1
$ conda create -n bow python=3.6
$ source activate bow
$ conda install -c conda-forge spacy
$ python -m spacy download en
$ conda install -c conda-forge matplotlib
$ #(loc) conda install pytorch torchvision -c pytorch
$ #(HPC) pip install torch torchvision

Data

Download "Large Movie Review Dataset v1.0" from http://ai.stanford.edu/~amaas/data/sentiment/

This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.

$ mkdir data
$ cd data/
$ wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
$ tar -xvzf aclImdb_v1.tar.gz
$ #rm aclImdb_v1.tar.gz

Overview

Run the following command in the root directory of the project to reproduce the final trial:

$ #[set up conda env and install requirements]
$ python main.py

See main.py for the main script to run. Change default parameter settings in settings.py. Slightly modified versions of the lab code can be found in bow_model.py and torch_data_loader.py. The utils.py script contains all of the preprocessing code. Finally, the plots and result tables are generated in plot.py (see also plots/ and results/).

Preliminary results

Ablation study

  • Tokenzation schemes
  • Number of epochs
  • N-gram size
  • Vocabulary size
  • Embedding size
  • Optimimizer (adam vs. sgd)
  • Learning rate

Tokenization

  1. Baseline: string.split()
  2. Tokenization using spaCy
  3. Tokenization using spaCy, filtering of stop words and punctuation
  4. Tokenization using spaCy, filtering of stop words and punctuation, lemmatization

So far, the second tokenization scheme [2] works best. Lemmatization seems to be overkill, but filtering stop words and punctuation is helpful. It looks like the model is overfitting though, so let's adjust the learning rate next.

Learning rate

The default learning rate was set to 0.01, which is pretty high for the Adam optimizer. Results:

  • 1e-2: clearly overfitting
  • 1e-3: nice learning curve
  • 1e-4: too slow

We will stick with a learning of 1e-3 for Adam for now. It looks like we can also reduce the number of epochs from 10 to 2 for the following experiments.

N-gram and vocabulary size

Testing n-gram sizes between [1-4] surprisingly did not yield very promising results. Evaluating different vocabulary sizes [10k, 50k, 100k] did not significantly affect model performance either. Reverting the tokenization scheme back to [1] and including stop words and punctuation improves the results, especially when using bigrams over unigrams. Increasing the number of epochs to 5 again was necessary for the learning curve to converge. Still, the best results are consistently achieved with the more rigorous tokenization scheme [2] and using unigrams [n=1] with a maximum vocabulary size of [50k].

Embedding size

Results: 200d > 100d > 50d.

Optimizer

We will compare adam vs. sgd, both with default parameters. sgd doesn't seem to be working well at all. We will keep using adam for now.

Linear annealing of learning rate

Not helpful.

Number of epochs

Training for 2 epochs seems sufficient, otherwise the model starts to overfit on the training data.

Testing accuracy

Final accuracy on the testing set: 86.212.

Releases

No releases published

Packages

No packages published

Languages