Skip to content

This repository contains a collection of three Data Engineering capstone projects made for the DTU Data Engineering course 02807: Computational Tools for Data Science

Notifications You must be signed in to change notification settings

seby-sbirna/Computational-Data-Processing-using-Spark-Pandas-and-Data-Streaming

Repository files navigation

Computational Data Processing using Spark, Pandas and Data Streaming

by Sebastian Sbirna, Yingrui Li and Aijie Shu


This repository contains a set of three full Data Science projects, created with a strong focus on tools and methods for working with data at scale.

The course is based on mastering tools for analyzing Big Data and large-scale datasets which have high computational demands, and are normally manipulated using a distributed cluster of machines or through statistical approximations.

This course's objective is to enable us to develop and implement parallel and distributed algorithms for data science applications, and to apply database technologies and models or other relevant technologies and literature related to computational tools and techniques for massive data sets.

The point of the presented Project Assignments is to consolidate the skills we have learned throughout the course through specific company case-study problems.

In particular, we have built a database evaluation of anonymous customer data using Pandas and SQL:

Afterwards, we have analyzed a continuous web traffic stream of data using HyperLogLog and CountMin probabilistic data structures:

Lastly, we have collaborated using Spark on Airbnb's massive database to assess popularity of certain cities' neighbourhoods and lodging prices over time, as well as an overall sentiment analysis upon the text reviews of lodgings in that particular city: