NLP Sentiment Analysis Overview

Student name: Steve Newman
Student pace: part time
Scheduled project review date/time: Tues. June 30 3:00 pm EST
Instructor name: James Irving PhD
Blog post URL:https://medium.com/p/a4baec51040b/edit

NLP Sentiment Analysis Overview

The business case for this project was to identify negative sentiment tweets about brand specific products for the purpose of improving the brand's reputation. A Natural Language Processing model was deployed using various classifiers and Tfidf Vectorizers. The most significant metric used to evaluate performance was "Recall" specifically on the "Negative" class.

A supervised learning method was used in which the training and test sets were labeled with a corresponding sentiment. An important caveat is that the models were optimized to identify negative sentiments. The results of neutral and positive sentiments were not considered for the success of the analysis.

Methodology

Obtaining optimal results, consisted of the following method:

Import Packages and Functions
Exploratory Data Analysis (EDA)
Train/Test/Split
Test multiple classifiers in simple pipelines
Select a high performing pipeline and use balancing techniques to further optimize
Attempt to further improve model by employing a grid search of relevant parameters
Visualize results

from IPython.display import clear_output
!pip install -U fsds_100719
clear_output()
from fsds_100719.imports import *

import warnings
warnings.filterwarnings('ignore')

Functions

def evaluate_model(clf, y_trn, y_true, y_pred, X_trn, X_true):
    
    '''
    Calculates and displays the following: Train and Test Score, Classification Report, 
    and Confusion Matrix.
    
        Parameters:
            
            clf: classifier or instanciated model from run_model function
            y_trn: y train from test, train, split
            y_true: y test from test, train, split
            y_pred: y hat test from run_model fuction
            X_trn: X train from test, train, split
            X_true: X test from test, train, split
    
    '''
    # Calculates and displays train and test scores.
    train_score = clf.score(X_trn,y_trn)
    test_score = clf.score(X_true,y_true)
    print(f"Train score= {train_score}")
    print(f"Test score= {test_score}\n")
    
    # Displays Classification Report / Scores 
    print(metrics.classification_report(y_true,y_pred))
    
    # Displays Confusion Matrix
    fig, ax = plt.subplots(figsize=(10,4))
    metrics.plot_confusion_matrix(clf,X_true,y_true,cmap="Reds",
                                  normalize='true',ax=ax)
    ax.set(title='Confusion Matrix')
    ax.grid(False)

EDA

#Import data

data = pd.read_csv('product_review.csv',encoding= 'unicode_escape')

data.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	tweet_text	emotion_in_tweet_is_directed_at	is_there_an_emotion_directed_at_a_brand_or_product
0	.@wesley83 I have a 3G iPhone. After 3 hrs twe...	iPhone	Negative emotion
1	@jessedee Know about @fludapp ? Awesome iPad/i...	iPad or iPhone App	Positive emotion
2	@swonderlin Can not wait for #iPad 2 also. The...	iPad	Positive emotion
3	@sxsw I hope this year's festival isn't as cra...	iPad or iPhone App	Negative emotion
4	@sxtxstate great stuff on Fri #SXSW: Marissa M...	Google	Positive emotion

# Rename columns for ease to work with.

data.rename(columns={'tweet_text': 'tweet', 'emotion_in_tweet_is_directed_at': 'product', 'is_there_an_emotion_directed_at_a_brand_or_product':'emotion'}, inplace=True)

data.shape

(9093, 3)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
tweet      9092 non-null object
product    3291 non-null object
emotion    9093 non-null object
dtypes: object(3)
memory usage: 213.2+ KB

# Most of the null data is in the product column.

data.isnull().sum()

tweet         1
product    5802
emotion       0
dtype: int64

data.fillna('unknown', inplace=True)

# Rename 'emotions' for ease to work with.

emotion_dict = dict({"No emotion toward brand or product":"Neutral",
                     "Positive emotion":"Positive", "Negative emotion":"Negative", 
                     "I can't tell":"I can't tell"})

data['emotion'] = data['emotion'].map(emotion_dict)

data['emotion'].value_counts()

Neutral         5389
Positive        2978
Negative         570
I can't tell     156
Name: emotion, dtype: int64

data.isnull().sum()

tweet      0
product    0
emotion    0
dtype: int64

# Convert "tweet" data into string format.

data["tweet"]= data["tweet"].astype(str)

# Add "text length" feature.

data['text length'] = data['tweet'].apply(len)

# Add "token length" feature.

data['token_length'] = [len(x.split(" ")) for x in data.tweet]
max(data.token_length)

data.isnull().sum()

tweet           0
product         0
emotion         0
text length     0
token_length    0
dtype: int64

data.shape

(9093, 5)

# Identify target variables.

data['emotion'].value_counts()

Neutral         5389
Positive        2978
Negative         570
I can't tell     156
Name: emotion, dtype: int64

bad_rows = data['emotion']== "I can't tell"

# Eliminate "I can't tell" variable.

data = data[~bad_rows]

data['emotion'].value_counts()

Neutral     5389
Positive    2978
Negative     570
Name: emotion, dtype: int64

Plot Numeric Features

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline

a = sns.FacetGrid(data,col='emotion')
a.map(plt.hist,'text length', density=True)

<seaborn.axisgrid.FacetGrid at 0x1c198c17b8>

Plot shows similar distribution of "text length" across the three emotion types.

token_l = sns.FacetGrid(data,col='emotion')
token_l.map(plt.hist,'token_length', density=True)

<seaborn.axisgrid.FacetGrid at 0x1c1adfbef0>

Plot shows slightly more "tokens" are used in the negative emotion category versus the positive and neutral.

sns.barplot(x='emotion',y='text length',data=data,palette='rainbow')

<matplotlib.axes._subplots.AxesSubplot at 0x1c19e6e630>

Barplot shows a very slight increase of text length versus other sentiments.

sns.countplot(x='emotion',data=data,palette='rainbow')

<matplotlib.axes._subplots.AxesSubplot at 0x1c19eb4fd0>

The majority of the tweets indicated "neutral" or "no emotion" towards the brand and product. The negative emotion category is significantly lower than positive and neutral categories. Adjusting for the imbalance will likely be necessary for developing a predictive model.

Train/Test/Split

from sklearn.model_selection import train_test_split

X = data['tweet']
y = data['emotion']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y,test_size=0.30,
                                                    random_state=123)

y_test.value_counts()

Neutral     1617
Positive     894
Negative     171
Name: emotion, dtype: int64

y_train.value_counts()

Neutral     3772
Positive    2084
Negative     399
Name: emotion, dtype: int64

LinearSVC

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from nltk import TweetTokenizer

# Initialize the TweetTokenizer

tokenizer = TweetTokenizer(preserve_case=False)

## Make a list of stopwords to remove
from nltk.corpus import stopwords
import string

# Get all the stop words in the English language
stopwords_list = stopwords.words('english')
stopwords_list

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 'not',
 'only',
 'own',
 'same',
 'so',
 'than',
 'too',
 'very',
 's',
 't',
 'can',
 'will',
 'just',
 'don',
 "don't",
 'should',
 "should've",
 'now',
 'd',
 'll',
 'm',
 'o',
 're',
 've',
 'y',
 'ain',
 'aren',
 "aren't",
 'couldn',
 "couldn't",
 'didn',
 "didn't",
 'doesn',
 "doesn't",
 'hadn',
 "hadn't",
 'hasn',
 "hasn't",
 'haven',
 "haven't",
 'isn',
 "isn't",
 'ma',
 'mightn',
 "mightn't",
 'mustn',
 "mustn't",
 'needn',
 "needn't",
 'shan',
 "shan't",
 'shouldn',
 "shouldn't",
 'wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't"]

additional_words = ['“','”','...','``',"''",'’',"#sxsw",'link',"@mention","}",
                    "{","rt","today","austin","SWSW","sxsw","quot","mention","Google",
                    "Apple","iPhone", "iPad"]

## Add punctuation to stopwords_list
stopwords_list+=string.punctuation
## Add additional_words to stopwords_list
stopwords_list.extend(additional_words)
stopwords_list

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 'not',
 'only',
 'own',
 'same',
 'so',
 'than',
 'too',
 'very',
 's',
 't',
 'can',
 'will',
 'just',
 'don',
 "don't",
 'should',
 "should've",
 'now',
 'd',
 'll',
 'm',
 'o',
 're',
 've',
 'y',
 'ain',
 'aren',
 "aren't",
 'couldn',
 "couldn't",
 'didn',
 "didn't",
 'doesn',
 "doesn't",
 'hadn',
 "hadn't",
 'hasn',
 "hasn't",
 'haven',
 "haven't",
 'isn',
 "isn't",
 'ma',
 'mightn',
 "mightn't",
 'mustn',
 "mustn't",
 'needn',
 "needn't",
 'shan',
 "shan't",
 'shouldn',
 "shouldn't",
 'wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't",
 '!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~',
 '“',
 '”',
 '...',
 '``',
 "''",
 '’',
 '#sxsw',
 'link',
 '@mention',
 '}',
 '{',
 'rt',
 'today',
 'austin',
 'SWSW',
 'sxsw',
 'quot',
 'mention',
 'Google',
 'Apple',
 'iPhone',
 'iPad']

linear_svc = Pipeline([('tfidf', TfidfVectorizer(lowercase=True, stop_words='english',
                                                 tokenizer=tokenizer.tokenize)),
                     ('clf', LinearSVC(class_weight='balanced'))])

linear_svc.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...
                                 tokenizer=<bound method TweetTokenizer.tokenize of <nltk.tokenize.casual.TweetTokenizer object at 0x1c1a09e908>>,
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 LinearSVC(C=1.0, class_weight='balanced', dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='squared_hinge', max_iter=1000,
                           multi_class='ovr', penalty='l2', random_state=None,
                           tol=0.0001, verbose=0))],
         verbose=False)

predictions = linear_svc.predict(X_test)

Metrics

from sklearn import metrics

evaluate_model(linear_svc, y_train, y_test, predictions, X_train, X_test)

Train score= 0.9322142286171063
Test score= 0.6849366144668159

              precision    recall  f1-score   support

    Negative       0.46      0.37      0.41       171
     Neutral       0.75      0.77      0.76      1617
    Positive       0.60      0.58      0.59       894

    accuracy                           0.68      2682
   macro avg       0.60      0.58      0.59      2682
weighted avg       0.68      0.68      0.68      2682

LinearSVC provided a high training score and above average test score. Recall for the "Negative" class is .37, a first benchmark.

MulitinomialNB

from sklearn.naive_bayes import MultinomialNB

text_mnb = Pipeline([('tfidf', TfidfVectorizer(lowercase=True, stop_words='english',
                                               tokenizer=tokenizer.tokenize)),
                     ('clf', MultinomialNB())])

# Feed the training data through the pipeline
text_mnb.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<bound method TweetTokenizer.tokenize of <nltk.tokenize.casual.TweetTokenizer object at 0x1c1a09e908>>,
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

predictions = text_mnb.predict(X_test)

Metrics

evaluate_model(text_mnb, y_train, y_test, predictions, X_train, X_test)

Train score= 0.750599520383693
Test score= 0.6528709917971663

              precision    recall  f1-score   support

    Negative       1.00      0.01      0.01       171
     Neutral       0.64      0.97      0.77      1617
    Positive       0.74      0.21      0.33       894

    accuracy                           0.65      2682
   macro avg       0.79      0.39      0.37      2682
weighted avg       0.70      0.65      0.58      2682

MultinomialNB showed dismal results for the "Negative" class.

SGDClassifier

from sklearn.linear_model import SGDClassifier

text_sgdc = Pipeline([('tfidf', TfidfVectorizer(lowercase=True, stop_words='english',
                                                tokenizer=tokenizer.tokenize)),
                     ('clf', SGDClassifier())])


text_sgdc.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...
                 SGDClassifier(alpha=0.0001, average=False, class_weight=None,
                               early_stopping=False, epsilon=0.1, eta0=0.0,
                               fit_intercept=True, l1_ratio=0.15,
                               learning_rate='optimal', loss='hinge',
                               max_iter=1000, n_iter_no_change=5, n_jobs=None,
                               penalty='l2', power_t=0.5, random_state=None,
                               shuffle=True, tol=0.001, validation_fraction=0.1,
                               verbose=0, warm_start=False))],
         verbose=False)

predictions = text_sgdc.predict(X_test)

Metrics

evaluate_model(text_sgdc, y_train, y_test, predictions, X_train, X_test)

Train score= 0.8954436450839328
Test score= 0.6976137211036539

              precision    recall  f1-score   support

    Negative       0.63      0.27      0.38       171
     Neutral       0.73      0.84      0.78      1617
    Positive       0.63      0.52      0.57       894

    accuracy                           0.70      2682
   macro avg       0.66      0.54      0.58      2682
weighted avg       0.69      0.70      0.68      2682

Recall is .28; lower than previous results.

LogisticRegression

from sklearn.linear_model import LogisticRegression

text_lr = Pipeline([('tfidf', TfidfVectorizer(lowercase=True, stop_words='english',
                                              tokenizer=tokenizer.tokenize)),
                     ('clf', LogisticRegression())])

text_lr.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...
                                 tokenizer=<bound method TweetTokenizer.tokenize of <nltk.tokenize.casual.TweetTokenizer object at 0x1c1a09e908>>,
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

predictions = text_lr.predict(X_test)

Metrics

evaluate_model(text_lr, y_train, y_test, predictions, X_train, X_test)

Train score= 0.8203037569944045
Test score= 0.6987322893363161

              precision    recall  f1-score   support

    Negative       0.63      0.07      0.13       171
     Neutral       0.71      0.88      0.79      1617
    Positive       0.65      0.49      0.56       894

    accuracy                           0.70      2682
   macro avg       0.67      0.48      0.49      2682
weighted avg       0.69      0.70      0.67      2682

Logistic Regression also showing low recall results at 0.07.

RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier

text_rfc = Pipeline([('tfidf', TfidfVectorizer(lowercase=True, stop_words='english',
                                               tokenizer=tokenizer.tokenize)),
                     ('clf', RandomForestClassifier(class_weight='balanced'))])

text_rfc.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight='balanced',
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=100, n_jobs=None,
                                        oob_score=False, random_state=None,
                                        verbose=0, warm_start=False))],
         verbose=False)

predictions = text_rfc.predict(X_test)

Metrics

evaluate_model(text_rfc, y_train, y_test, predictions, X_train, X_test)

Train score= 0.997761790567546
Test score= 0.6812080536912751

              precision    recall  f1-score   support

    Negative       0.75      0.16      0.26       171
     Neutral       0.69      0.89      0.78      1617
    Positive       0.65      0.41      0.50       894

    accuracy                           0.68      2682
   macro avg       0.70      0.48      0.51      2682
weighted avg       0.68      0.68      0.65      2682

Random Forest showed a recall result about in the middle of the spectrum of all classifiers tried. This particular algorithm has proven successful in similar scenarios; I will further optimize it with over/under sampling techniques.

Oversampling

from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import ADASYN, SMOTE, RandomOverSampler
from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(lowercase=True, stop_words=stopwords_list,
                            tokenizer=tokenizer.tokenize)

rfc = RandomForestClassifier(class_weight='balanced')

The under-sampling models significantly outperformed the over-sampling models.

RFC - RandomOverSampler

ROS_pipeline = make_pipeline(tfidf, RandomOverSampler(random_state=123), rfc)

ROS_pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', 'we',
                                             'our', 'ours', 'ourse...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight='balanced',
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=100, n_jobs=None,
                                        oob_score=False, random_state=None,
                                        verbose=0, warm_start=False))],
         verbose=False)

predictions = ROS_pipeline.predict(X_test)

Metrics

evaluate_model(ROS_pipeline, y_train, y_test, predictions, X_train, X_test)

Train score= 0.9646682653876898
Test score= 0.6756152125279642

              precision    recall  f1-score   support

    Negative       0.67      0.27      0.39       171
     Neutral       0.71      0.81      0.76      1617
    Positive       0.59      0.52      0.55       894

    accuracy                           0.68      2682
   macro avg       0.66      0.53      0.57      2682
weighted avg       0.67      0.68      0.66      2682

RFC - SMOTE

SMOTE_pipeline = make_pipeline(tfidf, SMOTE(random_state=123),rfc)
SMOTE_pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', 'we',
                                             'our', 'ours', 'ourse...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight='balanced',
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=100, n_jobs=None,
                                        oob_score=False, random_state=None,
                                        verbose=0, warm_start=False))],
         verbose=False)

predictions = SMOTE_pipeline.predict(X_test)

Metrics

evaluate_model(SMOTE_pipeline, y_train, y_test, predictions, X_train, X_test)

Train score= 0.9653077537969624
Test score= 0.6845637583892618

              precision    recall  f1-score   support

    Negative       0.73      0.26      0.39       171
     Neutral       0.71      0.83      0.77      1617
    Positive       0.61      0.50      0.55       894

    accuracy                           0.68      2682
   macro avg       0.68      0.53      0.57      2682
weighted avg       0.68      0.68      0.67      2682

RFC - ADASYN

ADASYN_pipeline = make_pipeline(tfidf, ADASYN(ratio='minority',random_state=123),rfc)

ADASYN_pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', 'we',
                                             'our', 'ours', 'ourse...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight='balanced',
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=100, n_jobs=None,
                                        oob_score=False, random_state=None,
                                        verbose=0, warm_start=False))],
         verbose=False)

predictions = ADASYN_pipeline.predict(X_test)

Metrics

evaluate_model(ADASYN_pipeline, y_train, y_test, predictions, X_train, X_test)

Train score= 0.964828137490008
Test score= 0.6838180462341537

              precision    recall  f1-score   support

    Negative       0.72      0.29      0.41       171
     Neutral       0.70      0.85      0.77      1617
    Positive       0.63      0.46      0.53       894

    accuracy                           0.68      2682
   macro avg       0.68      0.53      0.57      2682
weighted avg       0.68      0.68      0.67      2682

Undersampling

RFC Random Under Sampler

from imblearn.under_sampling import NearMiss, RandomUnderSampler

RUS_pipeline = make_pipeline(tfidf, RandomUnderSampler(random_state=123),rfc)

RUS_pipeline.fit(X_train, y_train)  
predictions = RUS_pipeline.predict(X_test)

Metrics

evaluate_model(RUS_pipeline, y_train, y_test, predictions, X_train, X_test)

Train score= 0.6083133493205436
Test score= 0.5395227442207308

              precision    recall  f1-score   support

    Negative       0.16      0.60      0.26       171
     Neutral       0.75      0.58      0.66      1617
    Positive       0.50      0.45      0.47       894

    accuracy                           0.54      2682
   macro avg       0.47      0.54      0.46      2682
weighted avg       0.63      0.54      0.57      2682

Random Under Sampler shows the best results so far at 0.71 recall. I tried a few variations of Near Miss to exahaust all possibilities.

RFC - Near Miss 1

NM1_pipeline = make_pipeline(tfidf, NearMiss(ratio='not minority',random_state=123, 
                                             version = 1),rfc)
NM1_pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', 'we',
                                             'our', 'ours', 'ourse...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight='balanced',
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=100, n_jobs=None,
                                        oob_score=False, random_state=None,
                                        verbose=0, warm_start=False))],
         verbose=False)

predictions = NM1_pipeline.predict(X_test)

Metrics

evaluate_model(NM1_pipeline, y_train, y_test, predictions, X_train, X_test)

Train score= 0.3317346123101519
Test score= 0.3053691275167785

              precision    recall  f1-score   support

    Negative       0.09      0.74      0.16       171
     Neutral       0.81      0.24      0.37      1617
    Positive       0.40      0.34      0.37       894

    accuracy                           0.31      2682
   macro avg       0.43      0.44      0.30      2682
weighted avg       0.63      0.31      0.35      2682

A wonderful result of 0.80 recall on the negative class. Unsure if it's possible to improve further, I tested a grid search of the most relevant parameters.

Extract Feature Importances

importances = NM1_pipeline.named_steps['randomforestclassifier'].feature_importances_

test = pd.DataFrame(importances)

fn = NM1_pipeline.named_steps['tfidfvectorizer'].get_feature_names()

len(fn)

treedf = pd.Series(NM1_pipeline.named_steps['randomforestclassifier'].feature_importances_,
                      index=NM1_pipeline.named_steps['tfidfvectorizer'].get_feature_names())

top_words = treedf.sort_values(ascending=False).head(10).index#.plot(kind='bar')

#     '''
#     "For loop" cycling through top 10 feature importances words and placing them in a 
#     dictionary with the "word" as the the key and "value counts" as the value per class.
    
#     Returns DataFrame with "word" as index and normalized value counts for each class
#     where the "word" is present.
#     '''

word_dict={}
for word in top_words:
    
    word_df=data.copy()

    word_df["contains"] = word_df["tweet"].str.contains(word)

    emotions = ["Negative", "Neutral", "Positive"]

    emo_dict = {}
    for emo in emotions:
        emo_df = word_df.groupby("emotion").get_group(emo)
        emo_df["contains"].value_counts(normalize=True)
        emo_dict[emo] = emo_df["contains"].value_counts(normalize=True).loc[True]

    word_dict[word]= pd.Series(emo_dict, name=word)

pd.DataFrame(word_dict).T

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Negative	Neutral	Positive
store	0.057895	0.115235	0.140363
new	0.085965	0.079050	0.095702
apple	0.033333	0.038597	0.065480
network	0.014035	0.023938	0.011081
google	0.040351	0.056040	0.050705
social	0.036842	0.039711	0.020819
major	0.005263	0.005196	0.003022
launch	0.033333	0.059751	0.045668
circles	0.003509	0.010763	0.004701
ipad	0.040351	0.046020	0.058093

DataFrame with top ten feature importances are the index and normalized value counts for each class where the "word" is present.

Grid_RFC - Near Miss 1

params = {'randomforestclassifier__criterion':['gini','entropy'],
             'randomforestclassifier__max_depth':[None, 5, 3, 10],
             'randomforestclassifier__min_samples_leaf': [1,2,3],
         'randomforestclassifier__max_features':['auto','sqrt',3,5,10,30,70]}

NM1_pipeline = make_pipeline(tfidf, NearMiss(ratio='not minority',random_state=123, 
                                             version = 1),rfc)

 
grid = GridSearchCV(NM1_pipeline, cv=5, n_jobs=-1, param_grid=params ,
                    scoring='recall_macro')

grid.fit(X_train, y_train)

---------------------------------------------------------------------------

KeyboardInterrupt                         Traceback (most recent call last)

<ipython-input-127-3949096c802a> in <module>
----> 1 grid.fit(X_train, y_train)


~/anaconda3/envs/learn-env/lib/python3.6/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    708                 return results
    709 
--> 710             self._run_search(evaluate_candidates)
    711 
    712         # For multi-metric evaluation, store the best_index_, best_params_ and


~/anaconda3/envs/learn-env/lib/python3.6/site-packages/sklearn/model_selection/_search.py in _run_search(self, evaluate_candidates)
   1149     def _run_search(self, evaluate_candidates):
   1150         """Search all candidates in param_grid"""
-> 1151         evaluate_candidates(ParameterGrid(self.param_grid))
   1152 
   1153 


~/anaconda3/envs/learn-env/lib/python3.6/site-packages/sklearn/model_selection/_search.py in evaluate_candidates(candidate_params)
    687                                for parameters, (train, test)
    688                                in product(candidate_params,
--> 689                                           cv.split(X, y, groups)))
    690 
    691                 if len(out) < 1:


~/anaconda3/envs/learn-env/lib/python3.6/site-packages/joblib/parallel.py in __call__(self, iterable)
    932 
    933             with self._backend.retrieval_context():
--> 934                 self.retrieve()
    935             # Make sure that we get a last message telling us we are done
    936             elapsed_time = time.time() - self._start_time


~/anaconda3/envs/learn-env/lib/python3.6/site-packages/joblib/parallel.py in retrieve(self)
    831             try:
    832                 if getattr(self._backend, 'supports_timeout', False):
--> 833                     self._output.extend(job.get(timeout=self.timeout))
    834                 else:
    835                     self._output.extend(job.get())


~/anaconda3/envs/learn-env/lib/python3.6/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    519         AsyncResults.get from multiprocessing."""
    520         try:
--> 521             return future.result(timeout=timeout)
    522         except LokyTimeoutError:
    523             raise TimeoutError()


~/anaconda3/envs/learn-env/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
    425                 return self.__get_result()
    426 
--> 427             self._condition.wait(timeout)
    428 
    429             if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:


~/anaconda3/envs/learn-env/lib/python3.6/threading.py in wait(self, timeout)
    293         try:    # restore state no matter what (e.g., KeyboardInterrupt)
    294             if timeout is None:
--> 295                 waiter.acquire()
    296                 gotit = True
    297             else:


KeyboardInterrupt:

grid.score(X_test, y_test)

grid.best_params_

Research best estimator from grid

best_pipe = grid.best_estimator_

best_pipe

best_pipe.fit(X_train,y_train)

predictions = best_pipe.predict(X_test)

Metrics

evaluate_model(best_pipe, y_train, y_test, predictions, X_train, X_test)

Unfortunately, the grid search did not prove to be effective for further optimization as is evident by the .07 drop in recall.

RFC - Near Miss 2

NM2_pipeline = make_pipeline(tfidf, NearMiss(ratio='not minority',random_state=123,
                                             version = 2),rfc)
NM2_pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', 'we',
                                             'our', 'ours', 'ourse...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight='balanced',
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=100, n_jobs=None,
                                        oob_score=False, random_state=None,
                                        verbose=0, warm_start=False))],
         verbose=False)

predictions = NM2_pipeline.predict(X_test)

Metrics

evaluate_model(NM2_pipeline, y_train, y_test, predictions, X_train, X_test)

Train score= 0.6258992805755396
Test score= 0.551826994780015

              precision    recall  f1-score   support

    Negative       0.18      0.59      0.28       171
     Neutral       0.73      0.61      0.67      1617
    Positive       0.50      0.44      0.47       894

    accuracy                           0.55      2682
   macro avg       0.47      0.55      0.47      2682
weighted avg       0.62      0.55      0.57      2682

RFC - Near Miss 3

NM3_pipeline = make_pipeline(tfidf, NearMiss(ratio='not minority',random_state=123,
                                             version = 3, n_neighbors_ver3=4),rfc)
NM3_pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', 'we',
                                             'our', 'ours', 'ourse...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight='balanced',
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=100, n_jobs=None,
                                        oob_score=False, random_state=None,
                                        verbose=0, warm_start=False))],
         verbose=False)

predictions = NM3_pipeline.predict(X_test)

Metrics

evaluate_model(NM3_pipeline, y_train, y_test, predictions, X_train, X_test)

Train score= 0.5918465227817746
Test score= 0.5432513049962714

              precision    recall  f1-score   support

    Negative       0.15      0.58      0.24       171
     Neutral       0.73      0.59      0.65      1617
    Positive       0.55      0.45      0.49       894

    accuracy                           0.54      2682
   macro avg       0.48      0.54      0.46      2682
weighted avg       0.63      0.54      0.57      2682

Visualizations

from wordcloud import WordCloud

wcdata = data['emotion']== "Negative"
wcdata_pos = data['emotion']== "Positive"

pd.set_option('display.max_colwidth', 10000)
data[wcdata_pos].head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	tweet	product	emotion	text length	token_length
1	@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW	iPad or iPhone App	Positive	139	22
2	@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.	iPad	Positive	79	15
4	@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) & Matt Mullenweg (Wordpress)	Google	Positive	131	17
7	#SXSW is just starting, #CTIA is around the corner and #googleio is only a hop skip and a jump from there, good time to be an #android fan	Android	Positive	138	28
8	Beautifully smart and simple idea RT @madebymany @thenextweb wrote about our #hollergram iPad app for #sxsw! http://bit.ly/ieaVOB	iPad or iPhone App	Positive	129	17

An example of negative tweets.

wordcloud = WordCloud(stopwords=stopwords_list, background_color="white", max_words=100, 
                      contour_width=3, 
                      contour_color='steelblue')

wordcloud.generate(data[wcdata]['tweet'].to_string())

wordcloud.to_image()

The most popular words to appear in the negative tweets.

wordcloud.generate(data[wcdata_pos]['tweet'].to_string())

wordcloud.to_image()

The most popular words to appear in the positive tweets.

Conclusion

After exploring multiple classifiers and balancing techniques, the Random Forest model with the first variation of Near Miss performed best with a recall of 0.80. The words identified as contributing to optimize this model were not necessarily words that action could be taken on for example, "social", "new", and "store". The power of the model to identify slight differences of the presence of words among the three classifications is evident by how closely their presence was in each class. For example, the word "launch" showed up in 0.03% of negative tweets, 0.06% of neutral tweets and 0.04% of positive tweets. This sensitivity was evident across the most important words contributing to the model.

Recommendations

Brand managers and product developers might want to review the negative tweets such as:

".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead! I need to upgrade. Plugin stations at #SXSW."

"@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw"

"@mention - False Alarm: Google Circles Not Coming Now�ÛÒand Probably Not Ever? - {link} #Google #Circles #Social #SXSW"

Addressing the concerns of product users will go a long way to improve products and brand reputation which in turn will make the business more successful.

Lastly, developing a campaign with improvements made and promoting it across multiple channels will make a big impact on brand reputation and users impressions.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
images		images
time-series		time-series
.gitignore		.gitignore
.learn		.learn
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
Mod4_Project.ipynb		Mod4_Project.ipynb
NLP Sentiment Analysis.pptx		NLP Sentiment Analysis.pptx
NLTK Sentiment Analysis.ipynb		NLTK Sentiment Analysis.ipynb
README.md		README.md
README_old.md		README_old.md
output_103_1.png		output_103_1.png
output_109_1.png		output_109_1.png
output_115_1.png		output_115_1.png
output_146_1.png		output_146_1.png
output_151_1.png		output_151_1.png
output_157_0.png		output_157_0.png
output_159_0.png		output_159_0.png
output_32_1.png		output_32_1.png
output_34_1.png		output_34_1.png
output_36_1.png		output_36_1.png
output_38_1.png		output_38_1.png
output_54_1.png		output_54_1.png
output_61_1.png		output_61_1.png
output_68_1.png		output_68_1.png
output_75_1.png		output_75_1.png
output_82_1.png		output_82_1.png
output_92_1.png		output_92_1.png
output_97_1.png		output_97_1.png
product_review.csv		product_review.csv

License

SteveRNewman/NLP-Sentiment-Analysis-Overview

Folders and files

Latest commit

History

Repository files navigation

NLP Sentiment Analysis Overview

Methodology

Functions

EDA

Plot Numeric Features

Train/Test/Split

LinearSVC

Metrics

MulitinomialNB

Metrics

SGDClassifier

Metrics

LogisticRegression

Metrics

RandomForestClassifier

Metrics

Oversampling

The under-sampling models significantly outperformed the over-sampling models.

RFC - RandomOverSampler

Metrics

RFC - SMOTE

Metrics

RFC - ADASYN

Metrics

Undersampling

RFC Random Under Sampler

Metrics

RFC - Near Miss 1

Metrics

Extract Feature Importances

Grid_RFC - Near Miss 1

Metrics

RFC - Near Miss 2

Metrics

RFC - Near Miss 3

Metrics

Visualizations

Conclusion

Recommendations

About

Resources

License

Stars

Watchers

Forks

Languages