song success predictor

The objective is to predict the success of a music track using acoustic (tempo, key) and metadata (genre, song duration) features of the track.
The measure of success is tied to a song's "hotttness" score. This metric is assigned to tracks by the API providers based on mentions in news, play counts, radio airtime, Billboard rankings, and reviews on popular music websites. This measure is directly correlated with the revenue of the track on market.

There is heavy focus on ontological modeling, feature engineering, and model selection.

Ontology describing feature space

Findings

metadata features are stronger indicators of hottt than acoustic

Acoustic features are poor indicators of hottt, but features derived from the raw acoustic features have more predictive power
- We decide that some of the acoustic features could be combined into energy and danceability.
  - Find out that ontologies represent these measures as derived values from other features:
    - energy: function of (loudness, segment stuff)
    - danceability: function of (tempo, time_signature)

Combination of a couple of diverse features does better
- Combination of different energy calculations
- Combination of different metadata features
- Combination of different acoustic features
The raw acoustic features perform fine on the training set
- they actually perform better than the energy measures on the training set
- energy measures generalize better. theyre better on the test set

To run

Make sure the following files/folders are in the same directory:

tutorials/
MSongsDB/
MillionSongSubset/
swagmaster.db
create_track_metadata_db_custom.py

master plan

write script to build sample dataset
build another structure (pandas DataFrame?) to hold relevant fields for learning
try to predict song_hotttnesss using other features
- acoustic
  - key int,
  - tempo real,
  - loudness real,
  - time_signature int,
- metadata
  - duration real,
  - artist_familiarity real,
  - artist_hotttnesss real,
- What learning models should we try?
  - Logistic regression
  - SVM
  - kNN

building our dataset

CREATE TABLE songs (
    track_id            text PRIMARY KEY,
    title               text,
    song_id             text,
    release             text,
    artist_id           text,
    artist_mbid         text,
    artist_name         text,
    duration            real,
    artist_familiarity  real,
    artist_hotttnesss   real,
    year                int,
    track_7digitalid    int,
    shs_perf            int,  # ???
    shs_work            int   # ???
    # new ones vvv
    song_hotttnesss     real, 
    danceability        real, 
    energy              real, 
    key                 int,
    tempo               real, 
    loudness            real, 
    time_signature      int
);

Energy

energy: The feature mix we use to compute energy includes loudness and segment durations.

Danceability

danceability: We use a mix of features to compute danceability, including beat strength, tempo stability, overall tempo, and more.

Notes

Tutorial notebooks

MSD link to tutorials

tutorial_1

Shows how to iterate over the files within the MillionSongSubset
The AdditionFiles has sql databases set up to ping into the /data folder's contents
Runs through an exercise to find out which artist has the most songs in the dataset (by artist_id)

tutorial_3_track-metadata

Shows how to interface with the dataset (in db form) using sqlite.
- There are .db files in AdditionalFiles. This one uses track_metadata (subset_track_metadata.db)
subset_track_metadata.db
- Contains one table named 'songs'
- Contains the following columns
  - track_id text PRIMARY KEY,
  - title text,
  - song_id text,
  - release text,
  - artist_id text,
  - artist_mbid text,
  - artist_name text,
  - duration real,
  - artist_familiarity real,
  - artist_hotttnesss real,
  - year int
Some useful queries:
- Get all songs without MB ID's : SELECT artist_id,artist_mbid FROM songs WHERE artist_mbid=''
- Get all distinct artists: SELECT DISTINCT artist_id, artist_name FROM songs
- Get all dudes with a float>value: SELECT DISTINCT artist_name, artist_familiarity FROM songs WHERE artist_familiarity>.8
  - Can use this one to filter out the tracks where hotttnesss is 0. (empty data) (WHERE NOT artist_hotttnesss=0)

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
images		images
models		models
tutorials		tutorials
.gitignore		.gitignore
__init__.py		__init__.py
config.py		config.py
create_track_metadata_db_custom.py		create_track_metadata_db_custom.py
environment.yml		environment.yml
exploring-dataset.ipynb		exploring-dataset.ipynb
knn.ipynb		knn.ipynb
linear-regression.ipynb		linear-regression.ipynb
model_evaluator.py		model_evaluator.py
readme.md		readme.md
support vector regression.ipynb		support vector regression.ipynb
utilities.py		utilities.py

rshnn/song-success-predictor

Folders and files

Latest commit

History

Repository files navigation

song success predictor

Findings

To run

master plan

building our dataset

Energy

Danceability

Notes

Tutorial notebooks

tutorial_1

tutorial_3_track-metadata

References

About

Topics

Resources

Stars

Watchers

Forks

Languages