Skip to content

Mickeyo0o/MachineLearningSentimentAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Sentiment Analysis of Google Playstore Reviews

This project focuses on exploration of different sentiment analysis methods and classifying the sentiment of Google Playstore text reviews split into three categories (Positive, Neutral and Negative). Precisely, we test the Lexicon Based Approach, Bayesian Networks, k-NN and Fully Connected Neural Networks.

Features

  • Data Preprocessing: We preprocess the review data by removing unnecessary columns, handling missing values, and transforming the text data through techniques such as lowercasing, tokenization, removal of stopwords, and lemmatization.

  • Visualization: We visualize the frequency vectors of comments after Principal Component Analysis (PCA) for both original and preprocessed data to understand the distribution of sentiment in the dataset. Additionally, we check the proportions of each data class.

  • Lexicon-based Approach: This approach utilizes predefined lists of words with associated sentiment polarities to classify the sentiment of each review. We analyze different thresholds for determining sentiment polarity and evaluate the accuracy of the classification.

  • Bayesian Network Approach: We employ a probabilistic graphical model to represent the relationships between words or features in text data and the sentiment expressed in the text. The model learns from the data and provides sentiment classification based on learned probabilities. Additionally, we split the data into training and testing sets and use Multinomial and Bernoulli Naive Bayes classifiers to classify sentiment. We evaluate the accuracy, recall, precision, and F1-score of the classifiers.

  • k-NN Approach: We use the k-NN algorithm to classify sentiment based on the similarity of reviews in a feature space. We explore the impact of the number of nearest neighbors on classification accuracy. We recommend an optimal value for the number of neighbors based on computational complexity and performance.

  • Fully Connected Neural Network Approach: We create the model out of Dense layers to classify sentiment, while also considering additional dataset changes to address the class imbalance problem. Additionally, we use Dropout layers to lower the overfitting caused by the low quality of contents of reviews, that are usually informal or misstyped. We propose other techniques of dealing with these problems, as well as analyze how the model performs on lower number of categories - Negative and Positive.

Usage

To replicate the analysis and classification results, follow the steps outlined in the notebook and ensure all necessary dependencies are installed including the nltk library data. Environment.yml file has been added to easily create conda environment that contains all the required packages (without nltk data corpus)

Note

This project was developed as a part of the Machine Learning course.

Contributors

  • Kajetan Sulwiński (siemieniuk)
  • Szymon Siemieniuk (ekohachi22)
  • Mikołaj Marmurowicz (Mickeyo0o)