Sentiment Analysis of Social Media Posts with Apache Spark

This repository contains the sample source code and presentation used in the ignite session I have on JFall 2015. I also wrote a blog post on the subject which you can find here.

Presentation

The presentation (as PDF) can be found here.

Spark Hello World

A small runnable example of how to do do a word-count analysis is shown in HelloSparkWorld.java.

Running the analysis

Downloading the data

The 5GB dataset can be downloader using your favorite torrent client using this link.

You should end up with a RC_2015-01.bz2 file around 5GB in size.

The application.properties file has the default input set to /tmp/RC_2015-01.bz2. If you downloaded the file to a different location please change the properties file accordingly.

Configuration

The application has two config settings that need to be set by you (if their defaults are incorrect), these settings are contained in application.properties.

The input property should point to RC_2015-01.bz2 you just downloaded. The output property should point to an empty directory. The application will create the full directory if possible.

Running the Analysis

You can run the analysis by simply starting running the Main class. It should start a spark context and start an analysis run. You can then connect to http://localhost:4040/ to see the progress. Keep in mind that this process will take quite some time, more than one hour on my machine.

First it reads all the JSON and parses it into internal comment structures and analyses these. The resulting data is stored in a temporary object store location. This isn't strictly needed at all but since this part takes by far the most amount of time it's done for convenience: running new reduce operations on this dataset takes a lot less time than going through the entire deserialization again.

The object file is then used to do the count and sentiment reductions which are then written to their corresponding files.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
doc		doc
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc

doc

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pom.xml

pom.xml

Repository files navigation

Sentiment Analysis of Social Media Posts with Apache Spark

Presentation

Spark Hello World

Running the analysis

Downloading the data

Configuration

Running the Analysis

Links

About

Releases

Packages

Languages

License

nielsutrecht/jfall-sentiment

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis of Social Media Posts with Apache Spark

Presentation

Spark Hello World

Running the analysis

Downloading the data

Configuration

Running the Analysis

Links

About

Topics

Resources

License

Stars

Watchers

Forks

Languages