atds-NTUA-2021

A project for Advanced Topics in Database Systems course of ECE, NTUA for fall semester of academic year 2020-2021.

The main purpose of the project is to familirize the student with the use of Apache Spark and Apache Hadoop.

Team Members

Assignement

The assignement's and report's language is Greek, because that is the main language of this course.

Because of that, I am trying to describe the tasks in this file.

PART A

Part A is about writing queries using either RDD API, or Spark SQL, on this dataset.

Spark SQL queries need to work both with .csv files and with .parquet files.

Query	Description
Q1	Find for each year, from 2000 and later, the movie with the most profit. To calculate profit we use the formula: (cost - income)/cost. Movies with no release date, or 0 to cost or income are excluded.
Q2	Find the percentage of users with average rating given to movies greater than 3.
Q3	For each movie genre find the average movie rating and the number of movies that belongs to it.
Q4	For movies that belong to category 'Drama' find the average summary length (words) for each one of the four quinquenniums ('2000-2004', '2005-2009', '2010-2014', '2015-2019').
Q5	For each movie genre find the user that has given the most ratings to movies that belongs in this genre. Also, find the number of those ratings and the user's most and least beloved movie of the genre, using his ratings. If the user has given the same rating to more than one movies, the most popular of them needs to be selected.

PART B

PART B is about implementing repartition and broadcast join on RDD API, based on A Comparison of Join Algorithms for Log Processing in MapReduce.

The pseudocode for repartition join is A.1. and for broadcast join in A.4.

Furthermore, we had to make a change on some given code to test the performance of Spark SQL queries on parquet files with, or without, enabling the query optimizer to use broadcast join.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
code		code
output		output
Assignement.pdf		Assignement.pdf
LICENSE		LICENSE
README.md		README.md
Report.pdf		Report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

output

output

Assignement.pdf

Assignement.pdf

LICENSE

LICENSE

README.md

README.md

Report.pdf

Report.pdf

Repository files navigation

atds-NTUA-2021

Team Members

Assignement

PART A

PART B

About

Languages

License

FayStatha/atds-project-NTUA-2021

Folders and files

Latest commit

History

Repository files navigation

atds-NTUA-2021

Team Members

Assignement

PART A

PART B

About

Topics

Resources

License

Stars

Watchers

Forks

Languages