This project is related to the Udacity Data Engineer Nanodegree program, submitted in October 2020. The goal of the project is to build an ELT (Extract, Load, Transform) process using S3 and Redshift, scheduled and ran by an Airflow DAG.
The ELT process follows a ficticious company called Sparkify (hence the name of the DAG), which wants to create a relational database in Redshift based on a database of songs it holds in json
format (the Million Song Dataset), and combine it with logs of user activity data (i.e. users listening to specific songs). Ultimately, we want to build a star-schema database, which contains the following tables:
FACT: songplays DIM: artists songs time users
In this project, we are using specifically Airflow to schedule the execution of this ETL pipeline, broken down into specific tasks
in an Airflow DAG. This provides the following advantages over simply running a Python script at a specified interval:
- Increased visibility into each step of the ELT process through Airflow UI
- Speed up of processes via parallelization
- Fault tolerance for specific tasks failing -- they can rerun, without affecting the whole pipeline
- Increased accountability to stakeholders via clearly defined steps, definitions, schedules and SLAs
- Ability to create reusable and easily maintable code base via custom Operators and SubDAGs
- Easy parametrization of each run with context data (e.g. the time/date it was run)
- Simplified backfilling of past runs
In order to be able to run this project, you need the following:
- AWS credentials (IAM user credentials for a role that has full access to Redshift and Read access to S3)
- Python 3.7 (and an environment with
airflow
installed) - A Redshift cluster running, which allows incoming traffic.
You will need to configure Airflow to run on your device. I suggest following the steps outlined in the documentation (Quick Start).
Once you have accessed the Airflow browser UI, you will need to enter your aws_credentials
and redshift
Connections under the Admin section. You need these environment variables to be available in order to run the DAG.
Finally, once your scheduler is running and has picked up the DAG, you should be able to see the DAG in the web UI. All you need to do is to turn it on via the toggle. From the on, it will try to first backfill all the past data, and once it reaches current time, it will then run once every hour.