Sparkify - Data Lake with AWS

Sparkify - Data Lake with AWS - Udacity Data Engineering Expert Track.

In this project, I built an ETL pipeline that extracts Sparkify's data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables.

Project Details:

The project's purpose is to move the database and data warehouse of a music streaming startup called 'Sparkify' to a data lake, where data can be queried via PySpark.

Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

These processes will allow their analytics team to continue finding insights in what songs their users are listening to.

Fact Table:
- songplays Represents the metrics used for the song plays analytics.
  - songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent.
Dimension Tables:
- users The data of the users that registered in the application.
  - user_id, first_name, last_name, gender, level
- songs The data about songs in the music database.
  - song_id, title, artist_id, year, duration
- artists The data of the artists that are registered in the music database.
  - artist_id, name, location, latitude, longitude
- time timestamps of records in song plays broken down into specific units.
  - start_time, hour, day, week, month, year, weekday

Project Datasets:

The project has two datasets that reside in S3.

Song data: s3://udacity-dend/song_data.
Log data: s3://udacity-dend/log_data

Songs Dataset:

This dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID.

Log Dataset:

This dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations. The files are partitioned by year and month.

Project files:

The project includes two files:

etl.py reads data from S3, processes that data using Spark, and writes them back to S3.
dl.cfg contains the AWS credentials.

Tools and Technologies:

PySpark.
AWS Services.
Amazon Elastic MapReduce (EMR) Clusters and Notebooks.
Python 3.
ETL: Extract, Transform, Load Data.
Data Warehouse and Data Lakes Concepts.
Cloud Computing Concepts.
Big Data and NoSQL concepts.

Project Steps:

1- Design fact and dimension tables in a way to answers Sparkify's analytics team's given queries.
2- Launch an EMR Cluster and Notebook on AWS to deploy the Spark processes on it.
3- Build an ETL pipeline for a data lake hosted on S3.
4- Load Sparkify's data from S3.
5- Process the data into analytics tables using Spark.
6- Load the data back into S3.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Datasets		Datasets
README.md		README.md
dl.cfg		dl.cfg
etl.py		etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets

Datasets

README.md

README.md

dl.cfg

dl.cfg

etl.py

etl.py

Repository files navigation

Sparkify - Data Lake with AWS

Project Details:

Project Datasets:

Project files:

Tools and Technologies:

Project Steps:

About

Releases

Packages

Languages

Dina-Hosny/Sparkify---Data-Lake-with-AWS

Folders and files

Latest commit

History

Repository files navigation

Sparkify - Data Lake with AWS

Project Details:

Project Datasets:

Project files:

Tools and Technologies:

Project Steps:

About

Topics

Resources

Stars

Watchers

Forks

Languages