DCU Cloud project

1. Large dataset Analysis

“Stack Exchange is a network of question and answer websites on diverse topics in many different fields, each site covering a specific topic, where questions, answers, and users are subject to a reputation award process. The sites are modeled after Stack Overflow, a forum for computer programming questions that was the original site in this network.”

Stack Exchange Data Explorer (SEDE) https://data.stackexchange.com/stackoverflow/query/new

[Task 1] Data Acquisition:

We are required to acquire the top 200,000 posts by viewcount from the Stack Exchange site. Problem is that we can only download 50.000 records at a time.

[Task 2] Data Cleaning with PIG

Extract, transform and load the data as applicable

[Task 3] Querying with HIVE

1. The top 10 posts by score ?
1. The top 10 users by score ?
1. The number of distinct users, who used the word ‘hadoop’ in one of their posts ?

[Task 4] Calculate the per-user TF-IDF with HIVE

Find Top 10 terms used for each of the top 10 users by post score

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
MInkunKim_datacleaning.ipynb		MInkunKim_datacleaning.ipynb
MinkunKim_assignment_0001.pdf		MinkunKim_assignment_0001.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MInkunKim_datacleaning.ipynb

MInkunKim_datacleaning.ipynb

MinkunKim_assignment_0001.pdf

MinkunKim_assignment_0001.pdf

README.md

README.md

Repository files navigation

DCU Cloud project

1. Large dataset Analysis

2. ETL with AWS Datawarehousing

About

Releases

Packages

Languages

mainkoon81/DCU-project-03-BigData-DataWarehouse

Folders and files

Latest commit

History

Repository files navigation

DCU Cloud project

1. Large dataset Analysis

2. ETL with AWS Datawarehousing

About

Topics

Resources

Stars

Watchers

Forks

Languages