Skip to content

Sentiment Specific Word Embeddings

Shreyash Arya edited this page Mar 16, 2019 · 7 revisions

Word embeddings are typically learned from unannotated plain text and provide a dense vector representation of syntactic/semantic aspects of a word. These representations though are not able to distinguish contrasting aspects of a word sense, for example sentiment polarity or opposite senses (e.g. high/low).

In order to distinguish these contrasting aspects, one can use a training set of texts annotated with a specific polarity or sense, and specialize the generic word embeddings to take them into account.

The script dl-sentiwords.py allows creating sentiment specific word embeddings. You will need generic word embeddings created from texts with broad coverage, for example Wikipedia, using either:

  1. word2vec/gensim
  2. dl-words.py
  3. dl-wordspca.py

You can extract the plain text from a Wikipedia dump using WikiExtractor.

Then you need some training text annotated with polarity, for example sentiment annotated tweets. The tweets training file should be in the format of the SemEval 2013 Sentiment Analysis in Twitter, i.e. one tweet per line, in the following format:

<SID><tab><UID><tab><positive|negative|neutral|objective><tab><TWITTER_MESSAGE>

Example:

100032373000896513	15486118	positive	Wow!! Lady Gaga is actually at the Britney Spears Femme Fatale Concert tonight!!! She still listens to her music!!!! WOW!!!

If you are using word2vect embeddings, the script should be invoked like this:

dl-sentiwords.py training.tsv --vectors vectors.txt --variant word2vec

while if you are using DeepNL embeddings you must supply two separate files with words and vectors:

dl-sentiwords.py training.tsv --vocab words.txt --vectors vectors.txt

Notice that the words and vectors file will be updated by the program, so use a copy to avoid clobbering the originals.

Once you have trained the sentiment specific embeddings, you can use them as features for a sentiment classifier. Notice that the new vocabulary will also contain relevant bigram and trigram words, represented by the concatenation of the words with a '_' in between. The classifier will typically use additional features, for example:

  1. polarity score of word/ngram from a lexicon
  2. number of positive/negative emoticons
  3. number of all capital words (WOW)
  4. number of negation words (no, none, nobody)
  5. number of elongated words
  6. number of elongated punctuations (!!, ??)
  7. number of each class of POS

See Tang et al. 2014. Learning Sentiment-SpecificWord Embedding for Twitter Sentiment Classification for more details.