This repository contains the projects I completed in the Udacity Data Engineering Nanodegree.
In this project, we'll be creating a database schema and building an ETL pipeline using Python and SQL. The ETL pipeline will then be used to transfer data from json logs in local directories into tables in Postgres.
In this project, we'll be modelling data using Apache Cassandra and building an ETL pipeline using Python. The ETL pipeline will transfer data from a set of csv files within a directory into Apache Cassandra tables.
In this project, we'll be building an ETL pipeline to load data from S3 to staging tables on Amazon Redshift. We'll also execute SQL statements that create the analytics tables from these staging tables.
In this project, we'll build an ETL pipeline for a data lake hosted on S3. The ETL pipeline will load data from S3, process the data into analytics tables using Spark, and load them back into S3. This Spark process will be deployed on a cluster using AWS.
In this project, we'll build data pipelines using Apache Airflow to automate the data warehouse ETL process. For the ETL process, the source data resides in S3, and is transferred into a data warehouse hosted on Amazon Redshift.
In this project, we'll be enhancing the I94 with external data, such as the world temperature and US city demographic data. This project will provide the foundation for future analysis regarding possible relationships between a country's immigration and arrival statistics, and its temperature and population demographics.