Skip to content

add1993/hadoop-projects

Repository files navigation

hadoop-projects

Add the soc-LiveJournal1Adj.txt and the userdata.txt file to hdfs. Export jar files from the projects and run them using the following commands.

Input: Input files

  1. soc-LiveJournal1Adj.txt
    The input contains the adjacency list and has multiple lines in the following format:

    is a unique integer ID(userid) corresponding to a unique user.

  2. userdata.txt
    The userdata.txt contains dummy data which consist of
    column1 : userid ()
    column2 : firstname
    column3 : lastname
    column4 : address
    column5: city
    column6 :state
    column7 : zipcode
    column8 :country
    column9 :username
    column10 : date of birth.

Program 1: MapReduce program in Hadoop to implements a simple "Mutual/Common friend list of two friends". This program will find the mutual friends between two friends.

Logic :

Let's take an example of friend list of A, B and C.

Friends of A are B, C, D, E, F.
Friends of B are A, C, F.
Friends of C are A, B, E
So A and B have C, F as their mutual friends. A and C have B, E as their mutual friends. B and C have only A as their mutual friend.

Map Phase :

In map phase we need to split the friend list of each user and create pair with each friend.

Let's process A's friend list
(Friends of A are B, C, D, E , F)
Key | Value
A,B | B, C, D, E, F
A,C | B, C, D, E, F
A,D | B, C, D, E, F
A,E | B, C, D, E, F
A,F | B, C, D, E, F

Let's process B's friend list
(Friends of B are A, C, F)
Key | Value
A,B | A, C, F
B,C | A, C, F
B,F | A, C, F
We have created pair of B with each of it's friends and sorted it alphabetically. So, the first key (B,A) will become (A,B).

Reducer Phase :

After map phase is shuffling data item into group by key. Same keys go to the same reducer.
A,B | B, C, D, E, F
A,B | A, C, F

Shuffling into {A,B} group and sent to the same reducer.
A,B | {B, C, D, E , F}, {A, C, F}

So, finally at the reducer we have 2 lists corresponding to 2 people. Now, we need to find the intersection to get the mutual friends.

Optimization

To optimize the solution i.e. to make the intersection faster I have used similar concept as merge operation in merge sort. I have sorted the friend list in the map phase. So, in reducer side we get 2 sorted lists. This way we can use the merge like operation to take only the matching values instead of going for all possible combinations in O(N2).

Please, make sure that the keys are sorted alphabetically so that we get friends list for 2 person on the same reducer.

Output

The program will output the mutual friends for following pairs.
(0,1), (20, 28193), (1, 29826), (6222, 19272), (28041, 28056)

The code can be easily changed to find mutual friends between all the people by removing the loop which is checking for these keys given above.

<User_A>,<User_B><Mutual/Common Friend List>
where <User_A> & <User_B> are unique IDs corresponding to a user A and B (A and B are friends).
< Mutual/Common Friend List > is a comma-separated list of unique IDs corresponding to mutual friend list of User A and B.

Code : MutualFriends

Program 2: Find friend pairs whose number of common friends (number of mutual friends) is within the top-10 in all the pairs. Output the output in decreasing order.

Used two pair of map reduce jobs.
The first map reduce job will find the mutual friends and produce the output with the friends pair and their mutual friends.
The second map reduce job will read the previous job output and then send the result to the same reducer by using the constant key. The value from mapper in this phase will have format and we will directly send this complete line to the reducer and process it there.
In the second reducer, we will split the received value by first and we will get the first value as the friend pair and the second value as a comma separated mutual friend list.
We can then again split the mutual friends list and store the count in a java map and use custom comparator to sort the map. Once the map is sorted in descending order we can take the top 10 values.

Output Format:

<User_A>, <User_B><Mutual/Common Friend Number>

Code : MutualFriendsCount

Program 3: Given any two Users (they are friend) as input, output the list of the names and the city of their mutual friends.

We need to use the userdata.txt to get the extra user information and in memory join to get the required details. So, the idea is to load userdata.txt dataset into memory in every mapper, using a hash map data structure to facilitate random access to tuples based on the join key (userid). For this purpose, you can override the method setup (mapper initialization) inside the Map class and load the hash map there inside.

Output format:

UserA id, UserB id, list of [city] of their mutual Friends.

Sample Output:

0, 41 [Evangeline: Loveland, Agnes: Marietta]

Code : MutualFriendsInformation

Program 4 : Calculate lowest average age of the direct friends of the users and output the lowest 15.

Step 1: Calculate the average age of the direct friends of each user.
Step 2: Sort the users by the average age from step 1 in descending order.
Step 3. Output the tail 15 (15 lowest averages) users from step 2 with their address and the calculated average age.
We need to use reduce side join.

Code : MutualFriendsAverageAge

Use the following commands to run the jar files :

hadoop jar Part1.jar MutualFriends MutualFriends /user/soc-LiveJournal1Adj.txt /user/mfriendsout

hadoop jar Part2.jar MutualFriendsCount MutualFriendsCount /user/soc-LiveJournal1Adj.txt /user/mfc1 /user/mfc2

hadoop jar Part3.jar MutualFriendsInformation MutualFriendsInformation /user/soc-LiveJournal1Adj.txt /user/mfc /user/userdata.txt

hadoop jar Part4.jar MutualFriends MutualFriends /user/soc-LiveJournal1Adj.txt /user/mfc /user/userdata.txt /user/finaloutput

About

The repository contains programs using hadoop and solving the given problems using hadoop map reduce.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages