Content based Music Recommendation System

CLaaT Document

Overview 📝

We use automated recommender systems everywhere in our day-to -day lives,be it the TV show to binge watch on Netflix, playlists on Youtube or Job recommendations on LinkedIn. In the same space we have recommender systems in music applications as well - Spotify, FreeSound,Songkick to name a few. As part of an academic project for course Big Data Systems & Intelligence Analytics we will build a data pipeline based on a similar idea, wherein we will use pre trained model inference to suggest users recommended music playlists that suit their interests.

Goals 🎯

Our objective is to build an application where users will be able to provide a link to their music playlist and get relevant content based recommendations from our pre-trained Machine learning model. Features that are planned to be implemented around this recommendation system pipeline are as follows -

💡 Users can have the options to get most similar song recommendations based on Input song of user’s choice or input a user’s spotify playlist. The results for which are generated by using the K Means clustering model

💡 Using recently searched songs data of User, users would receive Email 📧 music recommendations on a weekly basis.

💡 Implement User analytics Dashboard to account for the following - Word cloud, playlist recommendation feedback(like/dislike), Most popular genres among users, last activity etc.

Use cases 📑

By creating veracity and options for Music listeners and analyzing their listening patterns, A Music Recommendation system like ours can be integrated/shared with social media platforms like Instagram and snapchat to search for similar songs based on the user input.

Dataset Source 🔦

Spotify Million playlist dataset : AIcrowd | Spotify Million Playlist Dataset Challenge | Challenges The dataset contains 1,000,000 playlists, including playlist titles and track titles, created by users on the Spotify platform between January 2010 and October 2017.It has 1 million Spotify playlists, over 2 million unique tracks, and nearly 300,000 artists.

Process Outline

Data Preprocessing
Understanding the Dataset - Exploratory Data Analysis(EDA)
Building a pipeline system using pretrained model and also using various tools/softwares(AWS Sagemaker, Streamlit, GCP)
Deploying the model on AWS
Build a web application using streamlit for showcasing the results.

Milestones

Time Frame	Tasks
Day 1 - 5	Data processing, EDA, Model selection
Day 5 - 10	Deployment of Models, Setup of Data pipeline, Streamlit Integration
Day 10 - 15	System integration, App enhancements, Testing and documentation

Project Setup

The project consists of 4 major components -

The model training Pipeline
The model Inference Pipeline
Music recommendation system Pipeline
Github Actions Workflow & Testing

First Lets look at the Requirements for this project -

Requirements

🐍 Python ➡ 3.9.7
altair==4.1.0
📊 matplotlib==3.5.0
🔢 numpy==1.19.5
⚒ openTSNE==0.6.1
📄 pandas==1.2.5
pip==21.3.1
plotly==5.4.0
🔀requests==2.25.1
scikit-learn==0.24.2
scipy==1.7.3
🎶spotipy==2.19.0
🖼streamlit==1.2.0
seaborn==0.11.2
tqdm==4.62.3
urllib3==1.26.7
wordcloud==1.8.1
python-dotenv
streamlit-aggrid
streamlit-option-menu
python-decouple
tk
opencv-python
Pillow

1. Setting up Model Training and Inference Pipeline

Following Youtube video created by us has the detailed walkthrough of how to go about the following stages:

Starting a AWS SageMaker Notebook Instance ➡ Training in the instance ➡ Deploying the model ➡ Using Lambda & API Gateway to expose model as API

2. Music Recommendation system deployment on GCP

2.1 Before you begin

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
Enable the Compute Engine API.

2.2 Create a Linux VM instance

In the Cloud console, go to the Create an instance page.
Go to Create an instance
In the Boot disk section, click Change to begin configuring your boot disk.
On the Public images tab, choose Ubuntu from the Operating system list.
Choose Ubuntu 20.04 LTS from the Version list.
In the Firewall section, select Allow HTTP traffic.
To create the VM, click Create.

Allow a short period of time for the instance to start. After the instance is ready, it's listed on the VM instances page with a ✅.

Compute Engine grants the user who creates the VM with the roles/compute.instanceAdmin role. Compute Engine also adds that user to the sudo group.

2.3 Connect to the VM instance

Connect to an instance by using the Google Cloud console and completing the following steps. You're connected to the VM as the user you used to access the VM instances page.

In the Cloud console, go to the VM instances page.
Go to VM instances
In the list of virtual machine instances, click SSH in the row of the instance that you want to connect to.
SSH button next to instance name.

2.4 Connect to GCS Bucket by Mounting using GCSFuse

Instead of relying on large chunck of code to read/write from GCS Bucket, we have used the GCSFuse feature to mount the 1 PetaByte GCS Bucket as a RAID Mount Drive partition in our VM Instance. Steps for the same are -

Add the gcsfuse distribution URL as a package source and import its public key:

export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb http://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -

Update the list of packages available and install gcsfuse.
```
sudo apt-get update
sudo apt-get install gcsfuse
```
(Ubuntu before wily only) Add yourself to the fuse group, then log out and back in:
```
sudo usermod -a -G fuse $USER
exit
```

2.5 Start the Streamlit Server

Change directory into the GCS Bucket after doing Step 2.4 and checkout to Streamlit folder, now start the streamlit server
```
streamlit run streamlit.py
```
This will start the server on the VM and will be accessible to anyone online

3. Github Actions Workflow & Testing

Testing the application using PyTest cases, helped us fix many flaws in the application, to acheive this we used the following packages:

3.1 Code coverage and Pytest unit testing

First install PyTest and Coverage package as -

pip install pytest
pip install coverage

Now that we have these packages installed, we can get the complete coverage and Pytest report by running the following line(in the project root directory) -

coverage run -m pytest

3.2 Application Load testing using Locust Server

Locust is an open source load testing tool. It defines user behaviour with Python code, and swarm your system with millions of simultaneous users.

dependencies: locust

pip install locust

Run Locust with command:

    locust -f locust_test.py

3.3 Github Action Workflows

For the scope of the project we have incorporated the following Github Actions workflows -

Close an Issue if the comment says "close issue"

Pytest and coverage test on multiple versions of python and OS - On each push/pull request

Weekly Spotify Recommendations based on last searched song by user - CRON workflow

Project Folder Structure

📦workflows  ➡ This folder contains all the workflows related to github actions
 ┣ 📜close-issue.yml
 ┣ 📜get-weekly-top10.yaml
 ┣ 📜hello.yaml
 ┣ 📜new_top50.yml
 ┣ 📜send-email.yml
 ┗ 📜testing_workflow.yml
 📦code
 ┣ 📜Data_Preprocessing.ipynb
 ┣ 📜Exploratory Data Analysis.ipynb
 ┣ 📜Get_MPD_Data.ipynb
 ┣ 📜kmeans-sagemaker.ipynb   ➡ This notebook runs on the AWS SageMaker Notebook instance to deploy KMeans
 ┣ 📜model_selection_visualization.ipynb
 ┣ 📜Playlist_Recommendation.ipynb
 ┣ 📜read_spotify_million_playlists.py
 ┗ 📜test.ipynb
 📦data   ➡ This folder contains the data preporcessing results from the Original dataset
 ┣ 📂smp_data
 ┣ 📜2022_19_19_23_29_24_MPD_Extended.csv
 ┣ 📜2022_19_19_23_29_40_playlists_20000.json
 ┣ 📜2022_20_20_00_58_35_Playlist_Feats_20000.csv
 ┣ 📜spotify_20K_playlists.db
 ┗ 📜MPD.csv
 📦htmlcov   ➡ Extensive Pytest and code coverage report
 ┣ 📜coverage_html.js
 ┣ 📜d_36f028580bb02cc8_locust_test_py.html
 ┣ 📜d_36f028580bb02cc8_test_songname_py.html
 ┣ 📜favicon_32.png
 ┣ 📜index.html
 ┣ 📜keybd_closed.png
 ┣ 📜keybd_open.png
 ┣ 📜status.json
 ┗ 📜style.css
 📦images
 📦model
 ┣ 📜openTSNETransformer.sav
 ┗ 📜StdScaler.sav
 📦streamlit
 ┣ 📂assets
 ┃ ┣ 📂images
 ┃ ┃ ┣ 📜login.gif
 ┃ ┃ ┣ 📜logo.png
 ┃ ┃ ┣ 📜settings.png
 ┃ ┃ ┣ 📜spotify.jpg
 ┃ ┃ ┣ 📜spotify.png
 ┃ ┃ ┣ 📜spotify_get_playlist_uri.png
 ┃ ┃ ┗ 📜twitter-logo.png
 ┃ ┣ 📜.DS_Store
 ┃ ┗ 📜styles.css
 ┣ 📜new.csv
 ┣ 📜spotipy_client.py   ➡ Contains all the helper functions needed in the Streamlit application
 ┗ 📜streamlit.py ➡ This is where it all comes together, the final streamlit application code is here
 📦test   ➡ This folder contains the testing scripts - PyTest, coverage and Locust Load testing
 ┣ 📜locust_test.py
 ┣ 📜test_songname.py
 ┗ 📜tracks_10.csv
 ┣ 📜requirements.txt
 ┣ 📜requirements_dev.txt
 ┗ 📜send_email.py

Project Demo Walk through Video

Contributions

Contributor	GitHub Issues	Status	% Contribution
Kshitij Zutshi	, , , , , , , , ,	✅ Complete	40%
Priyanka Dilip Shinde	, , , , , ,	✅ Complete	35%
Dwithika Shetty	, , ,	✅ Complete	25%

Reference

How to build a music recommender system. | Towards Data Science

Music Recommender System (iitk.ac.in)

Music APIs - A List of Free and Public APIs (the-api-collective.com)

AIcrowd | Spotify Million Playlist Dataset Challenge | Challenges

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.github/workflows		.github/workflows
code		code
configfiles		configfiles
htmlcov		htmlcov
model		model
streamlit		streamlit
test		test
tools		tools
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
avg_album_length_playlist.py		avg_album_length_playlist.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
send_email.py		send_email.py
tox.ini		tox.ini

License

kshitijzutshi/DAMG7245-Final-Project

Folders and files

Latest commit

History

Repository files navigation