Hadoop Tweet Wordcounter Job

What is it?

A tweet retriever & hadoop wordcounter job. Built to show the hadoop hdfs usage for a 'CSC338: Parallel & Distributed Processing' group project at Missouri State University.

Why use it?

For sentimental analysis of the most recent tweets containing a particular hashtag. For example, if you run the job on tweets containing the hashtag #food, you may be able to make conclusions about the most discussed meals within a given amount of time the tweets were retrieved.

How it works

There are 2 major steps in running the project:

Retrieve tweets via twitter-api by hashtag and put them into a textfile
Run a wordcounter script with Hadoop to count the number of word occurences in the retrieved tweets file

Steps to setup & run:

Make sure you have hadoop 2.7.3 and python 3 installed on your server or local computer where you will be hosting the project.
Run the tweet retriever script with ./get-tweets.sh #hashtag-word, where #hashtag-word is your desired hashtag. To quit the script after desired amount of tweets are retrieved, use 'Ctrl-C'.
To copy the textfile to HDFS & run the hadoop job on the file, run ./job.sh [textfilename.txt]. NOTE: The default textfile created from the tweet retriever in step 2 is 'fetched_tweets.txt', so this should be used unless you plan to use a different textfile in the hadoop wordcounter.

Viewing the Output:

Assuming the job completes successfully, the output will be placed in the local folder where the repository files are located.
Open and view the textfile inside the '/completed-wordcount' directory. Words are sorted by most common occurences to least common occurences

Notes:

There is a serial wordcount script that can be used in lieu of the Hadoop job script. Run python serial-wordcount.py [filename.txt] instead of the job.sh script.
Commands to manipulate the HDFS begin with hdfs dfs, followed by the command to execute with any other arguments
hdfs dfs -ls [directory name] can be used to verify files were copied to the HDFS
If you need to create a directory on the HDFS, run hdfs dfs -mkdir /directory-name
You can type 'hadoop-streaming-*.jar' instead of remembering the exact version number when accessing the hadoop jar file
$PWD gives the current working directory

External Resources & Documentation:

The Mapper & Reducer are based on this Hadoop Application Walkthrough

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitignore		.gitignore
README.md		README.md
get-tweets.sh		get-tweets.sh
job.sh		job.sh
mapper.py		mapper.py
reducer.py		reducer.py
serial-wordcount.py		serial-wordcount.py
serial_job.sh		serial_job.sh
tweet-script.py		tweet-script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

get-tweets.sh

get-tweets.sh

job.sh

job.sh

mapper.py

mapper.py

reducer.py

reducer.py

serial-wordcount.py

serial-wordcount.py

serial_job.sh

serial_job.sh

tweet-script.py

tweet-script.py

Repository files navigation

Hadoop Tweet Wordcounter Job

What is it?

Why use it?

How it works

Steps to setup & run:

Viewing the Output:

Notes:

External Resources & Documentation:

About

Releases

Packages

Languages

jakekemple/Hadoop-Tweet-Wordcounter

Folders and files

Latest commit

History

Repository files navigation

Hadoop Tweet Wordcounter Job

What is it?

Why use it?

How it works

Steps to setup & run:

Viewing the Output:

Notes:

External Resources & Documentation:

About

Topics

Resources

Stars

Watchers

Forks

Languages