Folk Song Lyric Clustering for Roud Index Number Prediction

Background

English-language folk songs can't be identified by name alone; their lyrics vary over space and time space and time, and even influence each other's content. The Roud Folk Song Index number has become the standard for grouping together different versions of the same song. Steve Roud began indexing in the 1970s, and is still indexing as of 2023.

I wanted to see if an ensemble of machine learning algorithms that could match his skill. Given the lyrics, would it choose the same groupings of songs, where the line between "same" and "different" is fuzzy? Could it help with future indexing?

Example lyrics clustering

Examining the clusters' Roud numbers in the t-SNE space shows good results, even with lyrics clusters of varying shapes and densities.

Steps, tools and sources

Data sources: Ballad Index and Digitrad (Mudcat) databases, in defunct formats
Exploration: Jupyter Notebooks
Extraction and transformation: Regex and custom Python functions
Embeddings: instructor-large via huggingface/SentenceTransformer
Dimensionality reduction: t-SNE model from Scikit-Learn
Soft clustering: HDBSCAN module by McInnes, Healy & Astels
Visualisation: plotly.express
...Probabilistic classification, web deployment: TBC

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
Resources		Resources
.gitignore		.gitignore
README.md		README.md
Roud-Classification-Prediction.ipynb		Roud-Classification-Prediction.ipynb
Roud-Clusters.ipynb		Roud-Clusters.ipynb
Roud-Clusters0.html		Roud-Clusters0.html
df_classify.p		df_classify.p
lyrics_dataset.p		lyrics_dataset.p
multi_roud_w_embed_label.p		multi_roud_w_embed_label.p
roud_mutiple_samples.p		roud_mutiple_samples.p
save.p		save.p

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resources

Resources

.gitignore

.gitignore

README.md

README.md

Roud-Classification-Prediction.ipynb

Roud-Classification-Prediction.ipynb

Roud-Clusters.ipynb

Roud-Clusters.ipynb

Roud-Clusters0.html

Roud-Clusters0.html

df_classify.p

df_classify.p

lyrics_dataset.p

lyrics_dataset.p

multi_roud_w_embed_label.p

multi_roud_w_embed_label.p

roud_mutiple_samples.p

roud_mutiple_samples.p

save.p

save.p

Repository files navigation

Folk Song Lyric Clustering for Roud Index Number Prediction

Background

Example lyrics clustering

Steps, tools and sources

About

Releases

Packages

Languages

sian0x0/Roud-Song-Clusters

Folders and files

Latest commit

History

Repository files navigation

Folk Song Lyric Clustering for Roud Index Number Prediction

Background

Example lyrics clustering

Steps, tools and sources

About

Topics

Resources

Stars

Watchers

Forks

Languages