DWH-AWS-RedShift

This folder contains the necessary program files to create Sparkify database and corresponding etl pipeline to load the database on AWS Redshift.

sql_queries.py
- Contains all the sql queries against the database used in etl.py
create_tables.py
- Drops and recreates tables.
etl.py
- reads and processes files from song_data and log_data and loads them into the tables.
etl.ipynb
- Jupyter notebook file used to build the etl process step by step.
README.md
dwh.cfg
- Configuration file used to connect to the Redshift Cluster.
redshift.cfg
- Configuration file used to create Redshift Cluster.
environment.yaml
- conda environment file to import the python environment used by the project.

Installation

Use the following command to clone the project repository.

git clone https://github.com/shilpamadini/DWH-AWS-RedShift.git

Create the environment using below command
```
conda env create -f environment.yaml
```
Activate the conda environment
```
source activate dand_py3
```
Follow the instructions in etl.ipynb to create your own Redshift cluster using Infrastructure as code (Iac).
Navigate to the project directory and run the following to create tables
```
python create_tables.py
```
Run the following to load the fact and dimension tables python etl.py
Test the etl load at any time by using etl.ipynb. Run the following command to launch jupyter notebook.
```
jupyter notebook
```

Functionality

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

This project aims to build an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for the analytics team to continue finding insights on the data.

Since the analytics team is interested in knowing what songs the users are listening to and probably interested in performing ranking ,aggregation to determine which song is played the most, what is most popular song, which artist released most popular songs. Analytics may also be interested in looking at the trends over a period of time.

In order to support the required analytics a star schema design is implemented to design the data warehouse. Songplay table is the fact table and song, user,artist and time are dimension tables. Database integrity is maintained by using Primary key and foreign key constraints in the table definitions.

Here is the ER diagram explaining the schema design.

Users and time tables are loaded with distribution All strategy as they contain small datasets and used in queries that aggregate data over a period of time. Time table also loaded with start_time as sort key. Having these two tables in all the nodes will reduce shuffling and query performance will increase.Artists and songs table are distributed by artist_id. Distributing the songplay table ,song and artist_id tables with artist_id will also reduce data shuffling between the nodes as records from the three tables for the same artist_id will reside in the same node.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pycache

pycache

README.md

README.md

create_tables.py

create_tables.py

dwh.cfg

dwh.cfg

environment.yaml

environment.yaml

etl.ipynb

etl.ipynb

etl.py

etl.py

redshift.cfg

redshift.cfg

sql_queries.py

sql_queries.py

Repository files navigation

DWH-AWS-RedShift

Contents

Installation

Functionality

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
__pycache__		__pycache__
README.md		README.md
create_tables.py		create_tables.py
dwh.cfg		dwh.cfg
environment.yaml		environment.yaml
etl.ipynb		etl.ipynb
etl.py		etl.py
redshift.cfg		redshift.cfg
sql_queries.py		sql_queries.py

shilpamadini/DWH-AWS-RedShift

Folders and files

Latest commit

History

Repository files navigation

DWH-AWS-RedShift

Contents

Installation

Functionality

About

Topics

Resources

Stars

Watchers

Forks

Languages