Skip to content

mainkoon81/DCU-project-03-BigData-DataWarehouse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

99 Commits
 
 
 
 
 
 

Repository files navigation

DCU Cloud project

1. Large dataset Analysis

“Stack Exchange is a network of question and answer websites on diverse topics in many different fields, each site covering a specific topic, where questions, answers, and users are subject to a reputation award process. The sites are modeled after Stack Overflow, a forum for computer programming questions that was the original site in this network.”

Stack Exchange Data Explorer (SEDE) https://data.stackexchange.com/stackoverflow/query/new

[Task 1] Data Acquisition:

  • We are required to acquire the top 200,000 posts by viewcount from the Stack Exchange site. Problem is that we can only download 50.000 records at a time.

[Task 2] Data Cleaning with PIG

  • Extract, transform and load the data as applicable

[Task 3] Querying with HIVE

    1. The top 10 posts by score ?
    1. The top 10 users by score ?
    1. The number of distinct users, who used the word ‘hadoop’ in one of their posts ?

[Task 4] Calculate the per-user TF-IDF with HIVE

  • Find Top 10 terms used for each of the top 10 users by post score

2. ETL with AWS Datawarehousing

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published