Skip to content

amrelauoty/Sparkify-ETL-Postgres

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sparkify ETL

Table of contents

Introduction

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app.

The analytics team is particularly interested in understanding what songs users are listening to.

We can extract the following data and more

  • The most played music in the entire app
  • The top 5 artists in the app according to their songs
  • The most day of the week people listen to songs
  • and more....

The ETL extracts the data from songs and log files into a single database as read-only database which runs with the star schema model for analysis.

The database management system used in the ETL is postgres which is realtional (SQL) database

Tools

pythonnumpypandaspsycopgPostgresql

Usage

  • To run the ETL, you have to create the database and the tables so we have to execute create_tables.py file.

  • Now we can run etl.py to transfer our data from songs and log files into the database tables.

Important: update the database and account credentials in the files to your own credentials

Congrats you executed your ETL successfully

Database Schema Design

Sparkify Schema is a star schema built for analysis and we can update it only with the ETL but users can not edit it.

Star Schema consists of facts and dimensions

Database Schema

Facts

Songplays fact

  • Records in log data associated with song plays
  • Records with page NextSong

Songplays

Members:-

  • songplay_id
  • start_time
  • user_id
  • level
  • song_id
  • artist_id
  • session_id
  • location
  • user_agent

Fact table contains the business event of the analysis which is listening to a song transactions

Fact table contains foreign keys for the dimensions around it.

Dimensions

Dimensions is the business entities around the business event

We have 4 dimensions which is songs, artists, users, time

Time dimension

  • timestamps of records in songplays broken down into specific units

Time

Members:-

  • start_time
  • hour
  • day
  • week
  • month
  • year
  • weekday

The Time dimension helps us to make the data consistant in the schema and remove redunduncies.

Songs dimension

  • Songs in music database

Songs

Members:-

  • song_id
  • title
  • artist_id
  • year
  • duration

Artists dimension

  • Artists in music database

Artists

Members:-

  • artist_id
  • name
  • location
  • latitude
  • longitude

Users diemension

  • Users in the app

Users

Members:-

  • user_id
  • first_name
  • last_name
  • gender
  • level

Sparkify Project Files

sql_queries.py

The file contains all create, drop, insert and select queries which the create_tables.py executes to create the database schema and integrates the data.

create_tables.py

We have 3 functions in this python file and main function which executes the functions

  • create_database()

    • Creates and connects to the sparkifydb
    • Returns the connection and cursor to sparkifydb
  • drop_tables(cur, conn)

    • Drops each table using the queries in drop_table_queries list in sql_queries.py
  • create_tables(cur, conn)

    • Creates each table using the queries in create_table_queries` list in sql_queries.py

etl.py

The file processes all the files and runs the inserts to insert data into the fact and dimensions and it consists of 3 functions

  • process_song_file(cur, filepath)

    • reads the song files from its path
    • inserts the songs data and artists data to songs and artists dimensions
  • process_log_file(cur, filepath)

    • reads the log files data from its path filtred by NextSong page
    • inserts the time records from the log files into time dimension
    • inserts the user records from the log files into users dimensions
    • inserts the transactions which is the songs listening to songplays fact
  • process_data(cur, conn, filepath, func)

    • Get all the files with .json extension which are the songs files and log files

etl.ipynb

This jupyter notebook file used for development to get the data from the first file only and ensure that the data is valid for insertion

test.ipynb

This file contains test queries and sanity tests for checking database schema and it's quality The file used sql magic to execute their queries

data folder

Contains log files in json and songs files and every file of songs files contains one song data

Future work

Create a dashboard for analytic queries on sparkify database in Microsoft Power BI

About

Sparkify-ETL is a project for data modeling in postgres

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published