Distributed-Systems-Project5-Hadoop-and-Spark

Project 1 - Distributed Systems

Project Objective:

Use Hadoop servers and mapper and reducer java classes to perform data processing operations.
Use Spark to perform data processing operations.

Tasks:

There was a series of 7 different tasks to complete in a Hadoop server that required the Reducer and Mapper to be changed, compiled, made into a JAR, and executed on Hadoop. An example of one of these tasks will be shown below.
Use a Apache Spark cluster to perform a series of analytics on a large text file.

Topics/Skills covered:

Distributed computing
Cloud computing
MapReduce
Hadoop
Spark
JARs

Demonstration of completed tasks:

Hadoop tasks:

All 7 tasks in Hadoop followed this format. It involved making some changes to the Java file to correctly compute the desired metric, then the file was compiled and turned into a JAR and run on the cluster. This is just one example:

Setup

mkdir Task2

cd Task0

copy WordCount.java:

cp -R /home/student134/Project5/Part_1/Task0/WordCount.java /home/student134/Project5/Part_1/Task2

Changes to java code:

    public static class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable>
    {
            private final static IntWritable one = new IntWritable(1);

            private Text word = new Text();	

            @Override
            //map method modified from original WordCount.java file provided by instructors
            //if statement testing for "fact" added

            public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
            {
                    //tokenize string
                    String line = value.toString();
                    StringTokenizer tokenizer = new StringTokenizer(line);

                    //parse through words
                    while(tokenizer.hasMoreTokens())
                    {
                            word.set(tokenizer.nextToken());

                            //Only major change of WordCount original code
                            //This tests if the word contains the "fact" string
                            if(word.toString().toLowerCase().contains("fact")){
                                    context.write(word, one);
                            }
                    }
            }
    }

More setup

mkdir wordCount_classes

Compile Java

javac -classpath /usr/local/hadoop/hadoop-core-1.2.1.jar:./wordCount_classes -d wordCount_classes

WordCount.java

Create Jar

jar -cvf wordcount.jar -C wordCount_classes/ .

Execute JAR

hadoop jar /home/student134/Project5/Part_1/Task2/wordcount.jar org.myorg.WordCount /user/student134/input/words.txt /user/student134/output/Task2

View Output:

hadoop dfs -cat /user/student134/output/Task2/part-r-00000

Merge output to local files: hadoop dfs -getmerge /user/student134/output/Task2 Task2Output

Spark Demonstration:

Similarly, the Spark section of this project involved a large text file that would be analyzed in numerous different ways on a Spark cluster. The images below demonstrate that the Spark cluster was running correctly and performing the requested analysis.

The text file is a .txt file of The Tempest:

Spark library added:

Screenshot of the Tempest Analytics file performing analytics on the Spark cluster:

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Project5Part2		Project5Part2
Project5Spark		Project5Spark
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project5Part2

Project5Part2

Project5Spark

Project5Spark

README.md

README.md

Repository files navigation

Distributed-Systems-Project5-Hadoop-and-Spark

About

Releases

Packages

Languages

DanMolenhouse/Distributed-Systems-Project5-Hadoop-and-Spark

Folders and files

Latest commit

History

Repository files navigation

Distributed-Systems-Project5-Hadoop-and-Spark

About

Topics

Resources

Stars

Watchers

Forks

Languages