Twitter Sentiment Analysis With Spark, MongoDB and Google Cloud

In this two part blog post I go over the classic problem of Twitter sentiment analysis. I found labeled Twitter data with 1.6 million tweets on the Kaggle website here. Through this analysis I'll touch on few different topics related to natural language processing and big data more generally. While 1.6 million tweets is not substantial amount of data and does not require working with Spark, I wanted to use Spark for ETL as well as machine learning since I haven't seen too many examples of how to do so in the context of Sentiment Analysis.

Part 1: ETL With PySpark and MongoDB

In the first part I go over Extract-Transform-Load (ETL) operations on text data using PySpark and MongoDB expanding on some details of Spark on the way. I then show how one can explore the data in the Mongo database using Compass and PyMongo. Spark is a great platform from performing batch ETL work on both structured and unstructed data. MongoDB is a document based NoSQL database that is fast, easy to use, allows for flexible schemas and perfect for working with text data. PySpark and MongoDB work well together allowing for fast, flexible ETL pipelines on large semi-structured data like those coming from the Twitter. While Part 1 is presented as a Juptyer notebook, the ETL job was submitted as a script BasicETL.py in the directory ETL.

Part 2: Machine Learning With Spark On Google Cloud

In this second part I will go over the actual machine learning aspect of Sentiment Anlysis using SparkML and ML Pipelines to build a basic linear classifier. After building a basic model for sentiment analysis, I'll introduce techniques to improve performance like removing stop words and using N-grams. I also introduce a custom Spark Transformer class that uses the NLTK to performing stemming. Lastly, I'll review hyper-parameter tunning with cross-validation to optimize our model. Using PySpark on this datset was a little too much for my peronsal laptop so I used Spark on a Hadoop cluster with Google Cloud's dataproc and datalab. I'll touch on a few of the details of working on Hadoop and Google Cloud as well.

Requirements

Part 1

Part 1 was completed on my laptop and therefore all the dependencies were installed using miniconda. The required dependencies can be installed using the command,

conda create -n sparketl -f environment.yml

Part 2

Part 2 was completed on Google Cloud on the dataproc image 1.3, the commands to recreate this environment are in GCP directory and the Python dependenices to be loaded onto the Hadoop cluster are in the requirements.txt file.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
ETL		ETL
GCP		GCP
images		images
.gitignore		.gitignore
LICENSE		LICENSE
Part1_ETL.ipynb		Part1_ETL.ipynb
Part2_Analysis.ipynb		Part2_Analysis.ipynb
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETL

ETL

GCP

GCP

images

images

.gitignore

.gitignore

LICENSE

LICENSE

Part1_ETL.ipynb

Part1_ETL.ipynb

Part2_Analysis.ipynb

Part2_Analysis.ipynb

README.md

README.md

environment.yml

environment.yml

requirements.txt

requirements.txt

Repository files navigation

Twitter Sentiment Analysis With Spark, MongoDB and Google Cloud

Part 1: ETL With PySpark and MongoDB

Part 2: Machine Learning With Spark On Google Cloud

Requirements

Part 1

Part 2

About

Releases

Packages

Languages

License

mdh266/TwitterSentimentAnalysis

Folders and files

Latest commit

History

Repository files navigation

Twitter Sentiment Analysis With Spark, MongoDB and Google Cloud

Part 1: ETL With PySpark and MongoDB

Part 2: Machine Learning With Spark On Google Cloud

Requirements

Part 1

Part 2

About

Topics

Resources

License

Stars

Watchers

Forks

Languages