A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analysis team is particularly interested in understanding what songs users are listening to.
We want to answer those 3 queries
- Give me the artist, song title and song's length in the music app history that was heard during sessionId = 338, and itemInSession = 4
- Give me only the following: name of artist, song (sorted by itemInSession) and **user (first and last name)** for userid = 10, sessionid = 182
- Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'
The ETL extracts the data from event files and but it in a new file called event_datafile_new
then get the required data for creating Cassandra tables depending on the query.
The database used in the project is Apache Cassandra
To run the ETL open ETL.ipynb
and execute the code sections one by one to create the keyspace and the table and executing the queries
The keyspace contains 3 tables (one table according to each query)
ETL.ipynb
:
- the file contains the process of to collect the data from all files to one new
CSV
file - Connect to cassandra with a cluster then creating a session
- Creating the keyspace for the tables and setting the replication configurations
- Creating the tables for every query and test the select statement to ensure that result is right
- Dropping the tables and closing the connection