Identify the Sentiments - Analytics Vidhya Contest

This project is submitted as python implementation in the contest of Analytics Vidhya called "Identify the Sentiments". I enjoyed the joining of this competition and all its process. This submited solution got the rank 8 in the public leaderboard.

The contest URL: https://datahack.analyticsvidhya.com/contest/linguipedia-codefest-natural-language-processing-1/

Problem Statement

Sentiment analysis remains one of the key problems that has seen extensive application of natural language processing. This time around, given the tweets from customers about various tech firms who manufacture and sell mobiles, computers, laptops, etc, the task is to identify if the tweets have a negative sentiment towards such companies or products.

Implementation Approach

Dataset

The train set contains 7,920 tweets
The test set contains 1,953 tweets

Text Cleaning and Preprocessing

We applied the below text preposessing on the training and testing tweets sets:

URLs removal: We have used Regular Expressions (or RegEx) to remove the URLs.
Punctuation marks removal: remove any punction marks from the text.
Numbers removal: replace any digits in the tweets with space.
Whitespaces removal
Convert the text to lowercase.
Text normalization by reducing the words to its base form by using the spacy library.

Tweets to ElMo Vectors

We imported and used the pretrained ELMo model from the Tensorflow Hub, where we extracted ELMo vectors for the cleaned tweets in the train and test datasets. Each tweet is represented by an ELMo vector of length 1024 interms of the tweet's words/tokens.

Tweets to BERT Vectors

We imported and used the pretrained google BERT model, where we extracted BERT vectors for the cleaned tweets in the train and test datasets. Each tweet is represented by an BERT vector of length 768 interms of the tweet's words/tokens.

Classifiaction Model Building and Evaluation

We have used the ELMo vectors and BERT vecyors as features of the train dataset to build and train a classification model. We evaluated our model by the F1 score metric since this is the official evaluation metric of the contest. We trained different classifications model as follows:

BERT and ELMo verctors with Support vector machine model, the evaluation score for this SVM model is: 0.8926634023
ELMo vectors with Multi-layer Perceptron (MLP) Nueral Network Model, the evaluation score for this MLP model is: 0.881236842720449
ELMo vectors with Support vector machine model, the evaluation score for this SVM model is: 0.883783815908335
ELMo vectors with Simple Logistic Regression model, the evaluation score for this LR model is: 0.7761904761904763
BERT vectors with Multi-layer Perceptron (MLP) Nueral Network Model, the evaluation score for this MLP model is: 0.6096144315591238
BERT vectors with Support vector machine model, the evaluation score for this SVM model is: 0.8851479845833099
BERT vectors with Simple Logistic Regression model, the evaluation score for this LR model is: 0.8781415572832524

Future Work

Try another classification models
Try to use the word2vec vectors combined with the ELMo vectors as features.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
LICENSE		LICENSE
README.md		README.md
bert_elmo_tweets_SVM_model_building_evaluation.py		bert_elmo_tweets_SVM_model_building_evaluation.py
bert_tweets_SVM_model_building_evaluation.py		bert_tweets_SVM_model_building_evaluation.py
bert_tweets_logistic_reg_model_building_evaluation.py		bert_tweets_logistic_reg_model_building_evaluation.py
bert_tweets_neural_model_building_evaluation.py		bert_tweets_neural_model_building_evaluation.py
elmo_tweets_SVM_model_building_evaluation.py		elmo_tweets_SVM_model_building_evaluation.py
elmo_tweets_logistic_reg_model_building_evaluation.py		elmo_tweets_logistic_reg_model_building_evaluation.py
elmo_tweets_neural_model_building_evaluation.py		elmo_tweets_neural_model_building_evaluation.py
test_oJQbWVk.csv		test_oJQbWVk.csv
train_2kmZucJ.csv		train_2kmZucJ.csv
tweets_to_BERT_vectors.py		tweets_to_BERT_vectors.py
tweets_to_elmo_vectors.py		tweets_to_elmo_vectors.py

License

mtala3t/Identify-the-Sentiments-AV-NLP-Contest

Folders and files

Latest commit

History

Repository files navigation

Identify the Sentiments - Analytics Vidhya Contest

Problem Statement

Implementation Approach

Dataset

Text Cleaning and Preprocessing

Tweets to ElMo Vectors

Tweets to BERT Vectors

Classifiaction Model Building and Evaluation

Future Work

About

Topics

Resources

License

Stars

Watchers

Forks

Languages