Author: Tianlin He
Date: 12 Aug 2020
Tag: #Udacity #Data Engineering
A startup called Sparkify wants to analyse the data they have been collected on songs and user activity on their music streaming app, with a particular interest in the songs that are listened by the users. The tasks of this project include:
- Query the data in the form of JSON logs from two local directories
- Create a Postgres database schema and ETL pipeline based on the queries
- Test the database and pipeline by running queries and validate with the expected results
The JSON data reside in two local directories with structure:
The log data record user activities (user id, user level etc), while the song data contain information about songs (artist, year etc).
We will create a star schema optimised for queries on song play analysis, which include a fact table and four dimension tables:
* `songplays` Records in log data associated with song plays in the app
users
Users in the app- songs
Songs in music database
artists
Artists in music databasetime
Timestamps of records in log data broken down into specific time units (hour, day, week etc)
The project is developed in a local environment (MacOS) with detailed steps in step-by-step-data-modelling-locally. In addition, there are several tips:
- For a typical data engineering task, the datatypes columns were not predefined. Thus they have to be manually added by inspecting the data files
- Don't forget to define a primary key (PK) for each table
ON CONFLICT (PK) DO action
in the case of duplicated rows
Once we built the database, we can run some queries in test.ipynb:
%sql SELECT COUNT(*) FROM songplays;
There are songplays
.
%sql SELECT * FROM songplays WHERE song_id IS NOT NULL;
songplay_id | start_time | user_id | level | song_id | artist_id | session_id | location | user_agent |
---|---|---|---|---|---|---|---|---|
5537 | 2018-11-21 21:56:47.796000 | 15 | paid | SOZCTXZ12AB0182364 | AR5KOSW1187FB35FF4 | 818 | Chicago-Naperville-Elgin, IL-IN-WI | "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36" |
We could only match one song from music with log data.
%sql SELECT level, COUNT(*) FROM users GROUP BY level;
There are
%%sql SELECT a.name AS artist, s.title AS song, s.duration AS song_length
FROM songs AS s
JOIN artists AS a
ON s.artist_id=a.artist_id
ORDER BY s.duration DESC
LIMIT 1;
The longest song is:
artist | song | song_length |
---|---|---|
Faiz Ali Faiz | Sohna Nee Sohna Data | 599.24853 |
%%sql SELECT *
FROM time
WHERE start_time=(SELECT MAX(start_time) FROM time);
The most recent record is:
start_time | hour | day | week | month | year | weekday |
---|---|---|---|---|---|---|
2018-11-30 19:54:24.796000 | 19 | 30 | 48 | 11 | 2018 | 4 |