This repository contains source code used by the team MAMLS in the 2022 Computational Intelligence Lab Project 2: Twitter Sentiment Classification.
The online version of this repository can be found at https://gitlab.ethz.ch/maml/cil-2022
Please refer to the project report "Project_Report_Sentiment_Classification" provided in the repository for further information about methodology and experiment results.
- Download
checkpoints_contents.zip
from https://drive.google.com/file/d/1PqzOOAIJTyIdFJQd73_8XC5gYpwUj1wX/view?usp=sharing and extract the contents of the archive into thecheckpoints
directory - Download
twitter-datasets_contents.zip
from https://drive.google.com/file/d/15MMDViAHEB_W8PycIoamKqvBq3YEUQdE/view?usp=sharing and extract the contents of the archive into thetwitter-datasets
directory
For this step we recommend being in a virtual environment running python >=3.8.5
pip install -r requirements.txt
To train the model our group submitted and have it output the submission file, simply run the following command:
python3 scripts/run.py stats_bert
Since training the model is time-consuming (13 hours on the Euler cluster), to just run the model checkpoint that we have provided, change the parameter LoadCheckpoint
to True
in the models/BERT/config.ini
file and run the following command:
python3 scripts/run.py stats_bert --only_predict
- Set up the environment
module load gcc/8.2.0 python_gpu/3.8.5 eth_proxy
pip install virtualenvwrapper
source /usr/local/bin/virtualenvwrapper.sh
mkvirtualenv cil_mamls
-
Transfer the repository to the euler cluster and set up all of the required files as outlined above
-
Install requirements
pip install -r requirements.txt
- Run the code
To be able to run the code, the command needs to be submitted as a batch job on Euler. We recommend requesting gpu access when running the code.
Example usage:
bsub -n 8 -W 16:00 -R "rusage[mem=8192, ngpus_excl_p=1]" python scripts/run.py stats_bert
To run any experiment, it is needed to run the scripts/run.py
script. The script takes two input arguments.
- The first is a positional argument which determines the model type. The possible values of the argument are:
mlp
- to run a purely MLP-based classification model - config file is located in themodels/MLP
directorysvm
- to run an SVM-based classification model - config file is located in themodels/SVM
directorybilstm
- to run a BiLSTM-based classification model - config file is located in themodels/BiLSTM
directorybert
- to run a BERT-based classification model using an MLP as its classification head - config file is located in themodels/BERT
directorybilstm_bert
- to run a BERT-based classification model using a BiLSTM as its classification head - config file is located in themodels/BERT
directorystats_bert
- to run a BERT-based classification model using a BiLSTM as its classification head that additionally takes statistical, lexical and topic features - config file is located in themodels/BERT
directory
- The second argument is an optional argument given as
--only_predict
. If it is given, the model will not be trained before generating predictions for the test data. If this argument is given without there being a model checkpoint to initialize the model from, the predictions will be generated using a model with randomly initialized parameters.
If we want to run a BiLSTM-based model without training, we can execute the following command:
python3 scripts/run.py bilstm --only_predict
To change any hyperparameters or configuration of the training and testing environment (e.g. learning rate, batch size, number of convolutional layers, location of the checkpoint file...), config.ini
files corresponding to the specific model type need to be changed. They are located in the models/<model_type>
directory.
For example, if we want to change the learning rate used while training a BiLSTM-based model, in the models/BiLSTM/config.ini
file, we change the value of the LR
parameter. All of the config.ini
files are provided with the configuration of the best-performing model in the category.
scripts
- Scripts to run the code as an end usertwitter-datasets/data
- Datasets on which the code can be runfinal_test_data.txt
,final_train_pos_full.txt
andfinal_train_neg_full.txt
are the version of the dataset with all of the preprocessing methods appliedfinal_test_data.txt
,final_train_pos_full.txt
andfinal_train_neg_full.txt
are the unprocessed versions of the datasetstest_stats.csv
,pos_stats.csv
andneg_stats.csv
are precomputed statistical features corresponding to each of the datasets
twitter-datasets/glove_twitter
- Where Glove embedding files are download and Fasttext embedding files saved after trainingmodels
- Source code of the models and utility files used for experiments and final solution. Every subdirectory contains aconfig.ini
file. When running the model, they define model hyperparameters and the configuration of the training environment. If a different configuration is needed, the parameters in theconfig.ini
files need to be changedSVM
- Source code for the SVM-based baselineMLP
- Source code for the MLP-based baselinesBiLSTM
- Source code for the BiLSTM models used in our experimentationBERT
- Source code for models that use BERT transformers
preprocessing
- Source code for preprocessing the datasetssubmissions
- Submission files generaed by the user will be saved herecheckpoints
- Model checkpoints generated during training will be saved herelda
- Latent Dirichlet Allocation topic model files will be saved herebiLSTM
- biLSTM based model files will be saved heretokenizer
- Tokenizer files will be saved hereparam_search_logs
- Parameter search results for BiLSTM based models will be saved hereRoBERTa_stats_final.pth
- Checkpoint with model weights used to generate our team's final submission to the Kaggle leaderboard
We provide the fully preprocessed and unpreprocessed versions of our dataset, but further variations of the dataset can be generated with any combination of the preprocessing steps.
Generating the preprocessed datasets has to be done manually within code, but the notebook Preprocessing.ipynb
outlines the example usage.
Sequence-to-sequence text normalization is implemented separately from other preprocessing steps because it relies on third-party code. This preprocessing step can be done through code, but it relies on having docker installed and running on the device running the code. Additionally, the internal workings of the function create a docker container running the sequence-to-sequence model and make API calls to it. This can be very slow so we recommend following the steps in preprocessing/seq2seq/README.md
to apply sequence-to-sequence normalization to another dataset.
Team name: MAMLS
Anton Alexandrov, Moritz Dück, Mert Ertugrul, Lovro Rabuzin
D-INFK, ETH Zurich