GitHub - seby-sbirna/Computational-Data-Processing-using-Spark-Pandas-and-Data-Streaming: This repository contains a collection of three Data Engineering capstone projects made for the DTU Data Engineering course 02807: Computational Tools for Data Science

Computational Data Processing using Spark, Pandas and Data Streaming

by Sebastian Sbirna, Yingrui Li and Aijie Shu

This repository contains a set of three full Data Science projects, created with a strong focus on tools and methods for working with data at scale.

The course is based on mastering tools for analyzing Big Data and large-scale datasets which have high computational demands, and are normally manipulated using a distributed cluster of machines or through statistical approximations.

This course's objective is to enable us to develop and implement parallel and distributed algorithms for data science applications, and to apply database technologies and models or other relevant technologies and literature related to computational tools and techniques for massive data sets.

The point of the presented Project Assignments is to consolidate the skills we have learned throughout the course through specific company case-study problems.

In particular, we have built a database evaluation of anonymous customer data using Pandas and SQL:

Project 1 - Big Data Analytics using Pandas upon company customer databases

Afterwards, we have analyzed a continuous web traffic stream of data using HyperLogLog and CountMin probabilistic data structures:

Project 2 - Web Traffic Analysis upon continuous streams of data

Lastly, we have collaborated using Spark on Airbnb's massive database to assess popularity of certain cities' neighbourhoods and lodging prices over time, as well as an overall sentiment analysis upon the text reviews of lodgings in that particular city:

Project 3 - Spark Data Processing and Trend Detection from Airbnb data

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Project 1 - Big Data Analytics using Pandas upon company customer databases		Project 1 - Big Data Analytics using Pandas upon company customer databases
Project 2 - Web Traffic Analysis upon continuous streams of data		Project 2 - Web Traffic Analysis upon continuous streams of data
Project 3 - Spark Data Processing and Trend Detection from Airbnb data		Project 3 - Spark Data Processing and Trend Detection from Airbnb data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project 1 - Big Data Analytics using Pandas upon company customer databases

Project 1 - Big Data Analytics using Pandas upon company customer databases

Project 2 - Web Traffic Analysis upon continuous streams of data

Project 2 - Web Traffic Analysis upon continuous streams of data

Project 3 - Spark Data Processing and Trend Detection from Airbnb data

Project 3 - Spark Data Processing and Trend Detection from Airbnb data

README.md

README.md

Repository files navigation

Computational Data Processing using Spark, Pandas and Data Streaming

by Sebastian Sbirna, Yingrui Li and Aijie Shu

About

Releases

Packages

Languages

seby-sbirna/Computational-Data-Processing-using-Spark-Pandas-and-Data-Streaming

Folders and files

Latest commit

History

Repository files navigation

Computational Data Processing using Spark, Pandas and Data Streaming

by Sebastian Sbirna, Yingrui Li and Aijie Shu

About

Topics

Resources

Stars

Watchers

Forks

Languages