Skip to content

JFall Presentation: Sentiment Analysis of Social Media Posts with Apache Spark

License

Notifications You must be signed in to change notification settings

nielsutrecht/jfall-sentiment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sentiment Analysis of Social Media Posts with Apache Spark

This repository contains the sample source code and presentation used in the ignite session I have on JFall 2015. I also wrote a blog post on the subject which you can find here.

Presentation

The presentation (as PDF) can be found here.

Spark Hello World

A small runnable example of how to do do a word-count analysis is shown in HelloSparkWorld.java.

Running the analysis

Downloading the data

The 5GB dataset can be downloader using your favorite torrent client using this link.

You should end up with a RC_2015-01.bz2 file around 5GB in size.

The application.properties file has the default input set to /tmp/RC_2015-01.bz2. If you downloaded the file to a different location please change the properties file accordingly.

Configuration

The application has two config settings that need to be set by you (if their defaults are incorrect), these settings are contained in application.properties.

The input property should point to RC_2015-01.bz2 you just downloaded. The output property should point to an empty directory. The application will create the full directory if possible.

Running the Analysis

You can run the analysis by simply starting running the Main class. It should start a spark context and start an analysis run. You can then connect to http://localhost:4040/ to see the progress. Keep in mind that this process will take quite some time, more than one hour on my machine.

First it reads all the JSON and parses it into internal comment structures and analyses these. The resulting data is stored in a temporary object store location. This isn't strictly needed at all but since this part takes by far the most amount of time it's done for convenience: running new reduce operations on this dataset takes a lot less time than going through the entire deserialization again.

The object file is then used to do the count and sentiment reductions which are then written to their corresponding files.

Links

About

JFall Presentation: Sentiment Analysis of Social Media Posts with Apache Spark

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages