Skip to content

Sparkify - Data Lake with AWS - Udacity Data Engineering Expert Track.

Notifications You must be signed in to change notification settings

Dina-Hosny/Sparkify---Data-Lake-with-AWS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Sparkify - Data Lake with AWS

Sparkify - Data Lake with AWS - Udacity Data Engineering Expert Track.

In this project, I built an ETL pipeline that extracts Sparkify's data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables.

Project Details:

The project's purpose is to move the database and data warehouse of a music streaming startup called 'Sparkify' to a data lake, where data can be queried via PySpark.

Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

These processes will allow their analytics team to continue finding insights in what songs their users are listening to.

  • Fact Table:

    • songplays Represents the metrics used for the song plays analytics.
      • songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent.
  • Dimension Tables:

    • users The data of the users that registered in the application.
      • user_id, first_name, last_name, gender, level
    • songs The data about songs in the music database.
      • song_id, title, artist_id, year, duration
    • artists The data of the artists that are registered in the music database.
      • artist_id, name, location, latitude, longitude
    • time timestamps of records in song plays broken down into specific units.
      • start_time, hour, day, week, month, year, weekday

Project Datasets:

The project has two datasets that reside in S3.

  • Song data: s3://udacity-dend/song_data.
  • Log data: s3://udacity-dend/log_data

Songs Dataset:

This dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID.

Log Dataset:

This dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations. The files are partitioned by year and month.

Project files:

The project includes two files:

  • etl.py reads data from S3, processes that data using Spark, and writes them back to S3.
  • dl.cfg contains the AWS credentials.

Tools and Technologies:

  • PySpark.
  • AWS Services.
  • Amazon Elastic MapReduce (EMR) Clusters and Notebooks.
  • Python 3.
  • ETL: Extract, Transform, Load Data.
  • Data Warehouse and Data Lakes Concepts.
  • Cloud Computing Concepts.
  • Big Data and NoSQL concepts.

Project Steps:

  • 1- Design fact and dimension tables in a way to answers Sparkify's analytics team's given queries.
  • 2- Launch an EMR Cluster and Notebook on AWS to deploy the Spark processes on it.
  • 3- Build an ETL pipeline for a data lake hosted on S3.
  • 4- Load Sparkify's data from S3.
  • 5- Process the data into analytics tables using Spark.
  • 6- Load the data back into S3.

Releases

No releases published

Packages

No packages published

Languages