Topic Modelling with NLP & Latent Dirichlet Allocation on Customer Reviews

Purpose of this project is to leverage reviews about major delivery companies that are operating in the UK, and perform NLP tasks to analyze different aspects of the reviews like the sentiment, most common words, probability distributions across word sequences, and more.

Introduction

In this project we are going to explore the world of logistic companies and the issues that they might be facing. Specifically, we are going to focus on analyzing data regarding a few of the most well-known delivery companies in the UK, namely Deliveroo, UberEats, Just Eat and Stuart. To do that, we are going to utilize the internet and the reviews that someone can many different platforms - especially these platforms that are specializing at collecting reviews and opininions of customers for a plethora of companies and services.

The first iteration of this project it's using the reviews that can be found in the famous consumer review website TrustPilot. Even though the website is already providing some API functionalities, we are going to write our own web-scraping tool to retrieve the data in the format that we want. We will attempt to collect as many reviews as possible and then use them to identify interesting findings in the text. For example, we will try to identify what is the sentiment across all reviews for a specific company, what are the most common words and bigrams (i.e. pairs of words that tend to appear next to each other) in the reviews, and more. Finally, we will implement a Latent Dirichlet Allocation model to try and identify what are the topics that these reviews correspond to. Note that they LDA model is going to be implemented twice, one for the negative and one for the positive reviews.

Project Roadmap

graph   LR
    A[Build a tool to connect to web sources APIs] -->|Get reviews from web| B[Clean reviews]
    B --> D[Knowledge Graphs]
    B --> F[Unsupervised Clustering]
    B --> C(Sentiment Analysis)
    B --> |Identify topic of review| E[Topic Extraction]
    E -->  |Train Model| I[Assign Topic to new instances]
    C --> |Train Model| J[Sentiment Classifier]
    I --> K[Build UI]
    J --> K[Build UI]

Version 1.0: (Most recent version of the Notebook can be found here: V1.0 Notebook)

Impementation of the v1.0 of web scraper and data collection API
Developed a standard LDA model for topic identification
Created first version of visualizations to present the results

Web-Scrapping Tool and Data Retrieval

In order to collect the reviews directly from the TrustPilot website, we have created a web-scrapping tool that allowed us to automate this process across different companies & their corresponding reviews. This tool is iterating across different pages of the website and collects the reviews and any other relevant information, with the output being stored in CSV files. Moreover, we have packaged the tool into a python library. Hence, if you are thinking of working on a similar project where you need to retrieve data from TrustPilot, you can install the package that you can find here. As of January 2023, the package contains the main functionalities to collect many different information from the website, like the reviews, reviewer_id, date of the review, user rating, and more.

For the first iteration of the project, we have built the aforementioned package with the functionality to retrieve the following information - which will also be the features in our dataset:

Company: Name of the Company that we are examining (e.g. Deliveroo, UberEats, JustEat, Stuart)
Id: The unique identifier for the review
Reviewer_Id: Unique id for a reviewer/user
Title: Title of the review
Review: The text corresponding to the review submitted from the reviewer
Date: Day of review submission
Rating: The rating about the company, as submitted from the reviewer

Input Schema

Column/Feature	Type	Description
Company	NVARCHAR	Name of the delivery company
Id	NVARCHAR	Id of the review
Reviewer_Id	NVARCHAR	Id of the reviewer
Title	NVARCHAR	Title of the review
Review	NVARCHAR	The review itself - free text field
Date	DATE	Day that the review was submitted
Rating	BIGINT	Rating (1-5)

Data Retrieval API

To get reviews from the TrustPilot website, we are leveraging a custom made web scraping tool. This tool is iterating across different pages of the website and collects the reviews and any other relevant information, with the output being stored in CSV files.

Running Guide

Set-up the appropriate configurations in config.json (example). The config needs to get populated with the following metadata:
- source_url: Main domain URL
- starting_page: Domain subpath to a specific reviews page
- steps: Defines number of pages to iterate over
- company: Company/Service of interest
Execute the python retriever script
python data_retriever.py

Name		Name	Last commit message	Last commit date
Latest commit History 241 Commits
.vscode		.vscode
__pycache__		__pycache__
helpers		helpers
img		img
jupyter_notebook		jupyter_notebook
processing		processing
.gitattributes		.gitattributes
.gitignore		.gitignore
LDAModellerClass.py		LDAModellerClass.py
README.md		README.md
config.json		config.json
data_retriever.py		data_retriever.py
processed_pages.txt		processed_pages.txt
reviews.csv		reviews.csv
reviews.py		reviews.py
texteda.py		texteda.py
trustplt.py		trustplt.py

gpsyrou/Text_Analysis_of_Consumer_Reviews

Folders and files

Latest commit

History

Repository files navigation

Topic Modelling with NLP & Latent Dirichlet Allocation on Customer Reviews

Introduction

Project Roadmap

Version 1.0: (Most recent version of the Notebook can be found here: V1.0 Notebook)

Web-Scrapping Tool and Data Retrieval

Input Schema

Data Retrieval API

Running Guide

About

Topics

Resources

Stars

Watchers

Forks

Languages