Skip to content

Targeted Sentiment Analysis (TSA) on a fine grained dataset: NoReCfine.

Notifications You must be signed in to change notification settings

arthurdjn/targeted-sentiment-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

docs

Status

Documentation

The official docs can be foud here. We used PyTorch on this project, even for the documentation builds.

Code

Development Status Feature
Baseline finished
  • DataLoader
  • Word2Vec
  • BiLSTM
Alternative Label Encoding not started
  • BIOUL
Pipeline vs Joint prediction not started
  • Pipeline
  • Joint Prediction
  • Comparison
Architecture Impact in progress
  • LSTM
  • GRU
  • Character Level
  • Depth
Pretrained Embeddings in progress
  • ELMo
  • BERT
  • Multilingual BERT
Error Analysis finished
  • Confusion Matrix
  • Common Errors

Norwegian Data

For this targeted sentiment analysis, we used a training dataset in Norwegian with corresponding word embeddings.

NoRec Dataset

We will be working with the recently released NoReCfine, a dataset for finegrained sentiment analysis in Norwegian. The texts in the dataset have been annotated with respect to polar expressions, targets and holders of opinion but we will here be focusing on identification of targets and their polarity only. The underlying texts are taken from a corpus of professionally authored reviews from multiple news-sources and across a wide variety of domains, including literature, games, music, products, movies and more.

NLPL Word Embeddings

the word embeddings used are taken from the NLPL datasets, using the Norwegian-Bokmaal CoNLL17 corpus, with a vocabulary size of 1,182,371.

Getting Started

Set-up

Download this repository:

$ git clone https://github.uio.no/arthurd/wnnlp

The dataset is part of the repository, however you will need to give access to word embeddings. You can either download the Norwegian-Bokmaal CoNLL17 a.k.a the 58.zip file from the NLPL website, or provide them from SAGA server.

Make sure that you decode this file with encoding='latin1.

Baseline

$ python baseline.py --NUM_LAYERS         number of hidden layers for BiLSTM
                     --HIDDEN_DIM         dimensionality of LSTM layers
                     --BATCH_SIZE         number of examples to include in a batch
                     --DROPOUT            dropout to be applied after embedding layer
                     --EMBEDDING_DIM      dimensionality of embeddings
                     --EMBEDDINGS         location of pretrained embeddings
                     --TRAIN_EMBEDDINGS   whether to train or leave fixed
                     --LEARNING_RATE      learning rate for ADAM optimizer
                     --EPOCHS             number of epochs to train model

Grid Search

The grid search is currently availabel for the BiLSTM and BiGRU models. You can access through their inner parameters (and hyper parameters as well) through the gridsearch.ini configuration file. This file is divided into multiple sections, corresponding to diverse parameters, and you will find more information there.

To run the gridsearch algorithm, simply modify the above parameters and run:

$ python gridsearch.py --conf   PATH_TO_CONFIGURATION_FILE

Evaluation

To test and evaluate a saved model, use the eval.py script as follow:

$ python eval.py --model  PATH_TO_SAVED_MODEL
                 --data   PATH_TO_EVAL_DATA

About

Targeted Sentiment Analysis (TSA) on a fine grained dataset: NoReCfine.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published