Skip to content

Coursal/Hadoop-Examples

Repository files navigation

Hadoop Examples

Some simple, kinda introductory projects based on Apache Hadoop to be used as guides in order to make the MapReduce model look less weird or boring.

Preparations & Prerequisites

  • Latest stable version of Hadoop or at least the one used here, 3.3.0.
  • A single node setup is enough. You can also use the applications in a local cluster or in a cloud service, with needed changes on the map splits and the number of reducers, of course.
  • Of course, having (a somehow recent version of) Java installed. I have openjdk 11.0.5 installed to my 32-bit Ubuntu 16.04 system, and if I can do it, so can you.

Projects

Each project comes with its very own:

  • input data (.csv, .tsv, or simple text files in a folder ready to be copied to the HDFS).
  • execution guide (found in the source code of each project but also being heavily dependent of your setup of java and environment variables, so in case the guide doesn't work, you can always google/yahoo/bing/altavista your way to execution).

The projects featured in this repo are:

Calculating the average price of houses for sale by zipcode.

A typical "sum-it-up" example where for each bank we calculate the number and the sum of its transfers.

Typical case of finding the max recorded temperature for every city.

An interesting application of working on Olympic game stats in order to see the total wins of gold, silver, and bronze medals of every athlete.

Just a plain old normalization example for a bunch of students and their grades.

Finding the oldest tree per city district. Child's play.

A bit more challenging than the rest. Every key-character (A-E) has 3 numbers as values, two negatives and one positive. We just calculate the score for every character based on the following expression character_score = pos / (-1 * (neg_1 + neg_2)).

A simple way to calculate the symmetric difference between the records of two files, based on each record's ID.

Filtering out patients' records where their PatientCycleNum column is equal to 1 and their Counseling column is equal to No.

Reading a number of files with multiple lines and converting them into key-value pairs with each file's name as key and each file's content as value.

The most challenging yet. Τerm frequency is being calculated from 5 input documents. The goal is to find the document with the max TF for each word and how many documents contain that said word.

A simple merge of WordCount and TopN examples to find the 10 most used words in 5 input documents.


Check out the equivalent Spark Examples here.

About

Some simple, kinda introductory projects based on Apache Hadoop to be used as guides in order to make the MapReduce model look less weird or boring.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages