Skip to content

kshitijzutshi/DAMG7245-Final-Project

Repository files navigation

Content based Music Recommendation System

image

PyPI Tests

Python

CLaaT Document

Access the Claat document here 🚀

Open in Gitpod

Overview 📝

We use automated recommender systems everywhere in our day-to -day lives,be it the TV show to binge watch on Netflix, playlists on Youtube or Job recommendations on LinkedIn. In the same space we have recommender systems in music applications as well - Spotify, FreeSound,Songkick to name a few. As part of an academic project for course Big Data Systems & Intelligence Analytics we will build a data pipeline based on a similar idea, wherein we will use pre trained model inference to suggest users recommended music playlists that suit their interests.

Goals 🎯

Our objective is to build an application where users will be able to provide a link to their music playlist and get relevant content based recommendations from our pre-trained Machine learning model. Features that are planned to be implemented around this recommendation system pipeline are as follows -

💡 Users can have the options to get most similar song recommendations based on Input song of user’s choice or input a user’s spotify playlist. The results for which are generated by using the K Means clustering model

💡 Using recently searched songs data of User, users would receive Email 📧 music recommendations on a weekly basis.

💡 Implement User analytics Dashboard to account for the following - Word cloud, playlist recommendation feedback(like/dislike), Most popular genres among users, last activity etc.

Use cases 📑

By creating veracity and options for Music listeners and analyzing their listening patterns, A Music Recommendation system like ours can be integrated/shared with social media platforms like Instagram and snapchat to search for similar songs based on the user input.

Dataset Source 🔦

Spotify Million playlist dataset : AIcrowd | Spotify Million Playlist Dataset Challenge | Challenges The dataset contains 1,000,000 playlists, including playlist titles and track titles, created by users on the Spotify platform between January 2010 and October 2017.It has 1 million Spotify playlists, over 2 million unique tracks, and nearly 300,000 artists.

Process Outline

  1. Data Preprocessing
  2. Understanding the Dataset - Exploratory Data Analysis(EDA)
  3. Building a pipeline system using pretrained model and also using various tools/softwares(AWS Sagemaker, Streamlit, GCP)
  4. Deploying the model on AWS
  5. Build a web application using streamlit for showcasing the results.

Milestones

Time Frame Tasks
Day 1 - 5 Data processing, EDA, Model selection
Day 5 - 10 Deployment of Models, Setup of Data pipeline, Streamlit Integration
Day 10 - 15 System integration, App enhancements, Testing and documentation

Project Setup

The project consists of 4 major components -

  • The model training Pipeline
  • The model Inference Pipeline
  • Music recommendation system Pipeline
  • Github Actions Workflow & Testing

Final_Project_pipeline

First Lets look at the Requirements for this project -

Requirements

🐍 Python ➡ 3.9.7
altair==4.1.0
📊 matplotlib==3.5.0
🔢 numpy==1.19.5
⚒ openTSNE==0.6.1
📄 pandas==1.2.5
pip==21.3.1
plotly==5.4.0
🔀requests==2.25.1
scikit-learn==0.24.2
scipy==1.7.3
🎶spotipy==2.19.0
🖼streamlit==1.2.0
seaborn==0.11.2
tqdm==4.62.3
urllib3==1.26.7
wordcloud==1.8.1
python-dotenv
streamlit-aggrid
streamlit-option-menu
python-decouple
tk
opencv-python
Pillow

1. Setting up Model Training and Inference Pipeline

Following Youtube video created by us has the detailed walkthrough of how to go about the following stages:

Starting a AWS SageMaker Notebook Instance ➡ Training in the instance ➡ Deploying the model ➡ Using Lambda & API Gateway to expose model as API

IMAGE ALT TEXT HERE

2. Music Recommendation system deployment on GCP

2.1 Before you begin

  • In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
  • Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
  • Enable the Compute Engine API.

2.2 Create a Linux VM instance

  • In the Cloud console, go to the Create an instance page.
  • Go to Create an instance
  • In the Boot disk section, click Change to begin configuring your boot disk.
  • On the Public images tab, choose Ubuntu from the Operating system list.
  • Choose Ubuntu 20.04 LTS from the Version list.
  • In the Firewall section, select Allow HTTP traffic.
  • To create the VM, click Create.

Allow a short period of time for the instance to start. After the instance is ready, it's listed on the VM instances page with a ✅.

Compute Engine grants the user who creates the VM with the roles/compute.instanceAdmin role. Compute Engine also adds that user to the sudo group.

2.3 Connect to the VM instance

Connect to an instance by using the Google Cloud console and completing the following steps. You're connected to the VM as the user you used to access the VM instances page.

  • In the Cloud console, go to the VM instances page.
  • Go to VM instances
  • In the list of virtual machine instances, click SSH in the row of the instance that you want to connect to.
  • SSH button next to instance name.

2.4 Connect to GCS Bucket by Mounting using GCSFuse

Instead of relying on large chunck of code to read/write from GCS Bucket, we have used the GCSFuse feature to mount the 1 PetaByte GCS Bucket as a RAID Mount Drive partition in our VM Instance. Steps for the same are -

  1. Add the gcsfuse distribution URL as a package source and import its public key:

    export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
    echo "deb http://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
    curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
    
  2. Update the list of packages available and install gcsfuse.

    sudo apt-get update
    sudo apt-get install gcsfuse
    
  3. (Ubuntu before wily only) Add yourself to the fuse group, then log out and back in:

    sudo usermod -a -G fuse $USER
    exit
    

2.5 Start the Streamlit Server

  • Change directory into the GCS Bucket after doing Step 2.4 and checkout to Streamlit folder, now start the streamlit server
    streamlit run streamlit.py
    
  • This will start the server on the VM and will be accessible to anyone online

3. Github Actions Workflow & Testing

Testing the application using PyTest cases, helped us fix many flaws in the application, to acheive this we used the following packages:

3.1 Code coverage and Pytest unit testing

First install PyTest and Coverage package as -

pip install pytest
pip install coverage

Now that we have these packages installed, we can get the complete coverage and Pytest report by running the following line(in the project root directory) -

coverage run -m pytest

3.2 Application Load testing using Locust Server

Locust is an open source load testing tool. It defines user behaviour with Python code, and swarm your system with millions of simultaneous users.

dependencies: locust

pip install locust

Run Locust with command:

    locust -f locust_test.py

3.3 Github Action Workflows

For the scope of the project we have incorporated the following Github Actions workflows -

  1. Close an Issue if the comment says "close issue"

close-issue-actions

  1. Pytest and coverage test on multiple versions of python and OS - On each push/pull request

image

  1. Weekly Spotify Recommendations based on last searched song by user - CRON workflow

image

Project Folder Structure

📦workflows  ➡ This folder contains all the workflows related to github actions
 ┣ 📜close-issue.yml
 ┣ 📜get-weekly-top10.yaml
 ┣ 📜hello.yaml
 ┣ 📜new_top50.yml
 ┣ 📜send-email.yml
 ┗ 📜testing_workflow.yml
 📦code
 ┣ 📜Data_Preprocessing.ipynb
 ┣ 📜Exploratory Data Analysis.ipynb
 ┣ 📜Get_MPD_Data.ipynb
 ┣ 📜kmeans-sagemaker.ipynb   ➡ This notebook runs on the AWS SageMaker Notebook instance to deploy KMeans
 ┣ 📜model_selection_visualization.ipynb
 ┣ 📜Playlist_Recommendation.ipynb
 ┣ 📜read_spotify_million_playlists.py
 ┗ 📜test.ipynb
 📦data   ➡ This folder contains the data preporcessing results from the Original dataset
 ┣ 📂smp_data
 ┣ 📜2022_19_19_23_29_24_MPD_Extended.csv
 ┣ 📜2022_19_19_23_29_40_playlists_20000.json
 ┣ 📜2022_20_20_00_58_35_Playlist_Feats_20000.csv
 ┣ 📜spotify_20K_playlists.db
 ┗ 📜MPD.csv
 📦htmlcov   ➡ Extensive Pytest and code coverage report
 ┣ 📜coverage_html.js
 ┣ 📜d_36f028580bb02cc8_locust_test_py.html
 ┣ 📜d_36f028580bb02cc8_test_songname_py.html
 ┣ 📜favicon_32.png
 ┣ 📜index.html
 ┣ 📜keybd_closed.png
 ┣ 📜keybd_open.png
 ┣ 📜status.json
 ┗ 📜style.css
 📦images
 📦model
 ┣ 📜openTSNETransformer.sav
 ┗ 📜StdScaler.sav
 📦streamlit
 ┣ 📂assets
 ┃ ┣ 📂images
 ┃ ┃ ┣ 📜login.gif
 ┃ ┃ ┣ 📜logo.png
 ┃ ┃ ┣ 📜settings.png
 ┃ ┃ ┣ 📜spotify.jpg
 ┃ ┃ ┣ 📜spotify.png
 ┃ ┃ ┣ 📜spotify_get_playlist_uri.png
 ┃ ┃ ┗ 📜twitter-logo.png
 ┃ ┣ 📜.DS_Store
 ┃ ┗ 📜styles.css
 ┣ 📜new.csv
 ┣ 📜spotipy_client.py   ➡ Contains all the helper functions needed in the Streamlit application
 ┗ 📜streamlit.py ➡ This is where it all comes together, the final streamlit application code is here
 📦test   ➡ This folder contains the testing scripts - PyTest, coverage and Locust Load testing
 ┣ 📜locust_test.py
 ┣ 📜test_songname.py
 ┗ 📜tracks_10.csv
 ┣ 📜requirements.txt
 ┣ 📜requirements_dev.txt
 ┗ 📜send_email.py

Project Demo Walk through Video

IMAGE ALT TEXT HERE

Contributions

Contributor GitHub Issues Status % Contribution
Kshitij Zutshi Issue #4, Issue #10, Issue #9, Issue #6, Issue #6, Issue #17, Issue #23, Issue #21, Issue #26, Issue #27 ✅ Complete 40%
Priyanka Dilip Shinde Issue #1, Issue #5, Issue #6, Issue #12, Issue #24, Issue #7, Issue #20 ✅ Complete 35%
Dwithika Shetty Issue #2, Issue #6, Issue #11, Issue #22 ✅ Complete 25%

Reference

How to build a music recommender system. | Towards Data Science

Music Recommender System (iitk.ac.in)

Music APIs - A List of Free and Public APIs (the-api-collective.com)

AIcrowd | Spotify Million Playlist Dataset Challenge | Challenges