Table of Contents
- Goal
- Questions
- Methods
- Hypothesis
- Similarity in Track Sets
- Similarity in Genres
- Similarity in Features
- Distribution of Features
- Null Hyothesis Test
- Stones Left Unturned:
- Path Forward:
Explore Spotify's datasets to gain an understanding of the features that their apps use to classify audio tracks and tailor its music reccomendations to users.
Analyze the current "Top 50" Tracks of the United States, Canada, Mexico, the United Kingdom, and the Globe. Calculate the similarities using the following metrics:
- Similarity in Popular Tracks
- Similarity in Popular Genres
- Similarity in the Features of Popular Music (aka the essential musical/audio
charachteristics of Popular Tracks)
Hypothesis: The USA is the country whose "Top 50" tracks are the most similar to those of the Global "Top 50"
Use the scikit.learn vectorization module to take the lists of genres for each playlist and calculate the frequency of each genre. Then, calculate the cosine-similarity between every playlist's genre-vector, and create a similarity matrix. Finally, plot the matrix using a heatmap to visualize which playlists are most similar in their genres.
Similarity Matrix:
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
global | usa | uk | mex | can | |
---|---|---|---|---|---|
global | 1.000000 | 0.208407 | -0.863271 | -0.220740 | 0.488657 |
usa | 0.208407 | 1.000000 | -0.363109 | -0.707902 | 0.220003 |
uk | -0.863271 | -0.363109 | 1.000000 | 0.192726 | -0.562035 |
mex | -0.220740 | -0.707902 | 0.192726 | 1.000000 | -0.660480 |
can | 0.488657 | 0.220003 | -0.562035 | -0.660480 | 1.000000 |
The heatmap shows us that the two most similar playlists (whose intersection is the darkest shade of blue) are USA and Canada. However, contrary to my prediction, the playlist most similart to the global playlist is Canadas
Take the mean values for every feature in a playlist. Then, use these vectors to once again calculate the cosine similarity between each playlist.
Cosine Similarity Matrix
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
global | usa | uk | mex | can | |
---|---|---|---|---|---|
global | 1.000000 | 0.208407 | -0.863271 | -0.220740 | 0.488657 |
usa | 0.208407 | 1.000000 | -0.363109 | -0.707902 | 0.220003 |
uk | -0.863271 | -0.363109 | 1.000000 | 0.192726 | -0.562035 |
mex | -0.220740 | -0.707902 | 0.192726 | 1.000000 | -0.660480 |
can | 0.488657 | 0.220003 | -0.562035 | -0.660480 | 1.000000 |
two_tailed_test(global_df, usa_df, label1='Global', label2='USA', feature='acousticness')
pval = 0.612864976409074
fail to reject null hypothesis
two_tailed_test(global_df, usa_df, label1='Global', label2='USA', feature='danceability')
pval = 0.9898624889912536
fail to reject null hypothesis
two_tailed_test(global_df, usa_df, label1='Global', label2='USA', feature='energy')
pval = 0.9966979993191145
fail to reject null hypothesis
two_tailed_test(global_df, usa_df, label1='Global', label2='USA', feature='loudness')
pval = 0.6585050655009175
fail to reject null hypothesis
two_tailed_test(global_df, usa_df, label1='Global', label2='USA', feature='speechiness')
pval = 0.7483005130667619
fail to reject null hypothesis
- Which country most INFLUENCES the top 50?
- Which Features most INFLUENCE the top 50
- Expand the Datasets and use Machine Learning to Predict
the popularity/ranking of a track.