Binding Text, Images, Graphs, and Audio for Music Representation Learning

This repo contains the code for training, inference, and evaluation for the paper Binding Text, Images, Graphs, and Audio for Music Representation Learning

To help you navigate around checkpoints and inference, please refer to the following sheet.

The code for embedding Text and Images is availabe in the scripts folder. For Audio Embeddings, code is available here, for Graph Embeddings, code is available here

We also provide a simple demo that showcases the model's predictions with explanations. The demo is available here and the code is available here, you can run the demo locally as well with instructions provided in the repo.

Repo Structure:

scripts/ contains the code for setting up embedding APIs for Text and Images, and the LLM API, as well as code for downloading model weights.
data/ contains JSON files with tracks and their metadata, as well as our positives and negatives for training
modelling/ contains the code for the multimodal model, to use the modules, refer to the sheet mentioned above, each architecture has a different script that defines its architecture
embeddings/ contains JSON files with embeddings for each modality, as well as the multimodal embeddings
checkpoints/ contains the model weights for the multimodal model
notebooks/ contains notebooks for evaluation and inference

Abstract

In the field of Information Retrieval and Natural Language Processing, text embeddings play a significant role in tasks such as classification, clustering, and topic modeling. However, extending these embeddings to abstract concepts such as music, which involves multiple modalities, presents a unique challenge. Our work addresses this challenge by integrating rich multi-modal data into a unified joint embedding space. This space includes textual, visual, acoustic, and graph-based modality features. By doing so, we mirror cognitive processes associated with music interaction and overcome the disjoint nature of individual modalities. The resulting joint low-dimensional vector space facilitates retrieval, clustering, embedding space arithmetic, and cross-modal retrieval tasks. Importantly, our approach carries implications for music information retrieval and recommendation systems. Furthermore, we propose a novel multi-modal model that integrates various data types—text, images, graphs, and audio—for music representation learning. Our model aims to capture the complex relationships between different modalities, enhancing the overall understanding of music. By combining textual descriptions, visual imagery, graph-based structures, and audio signals, we create a comprehensive representation that can be leveraged for a wide range of music-related tasks. Notably, our model demonstrates promising results in music classification and recommendation systems.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
checkpoints		checkpoints
data		data
embeddings		embeddings
modelling		modelling
notebooks		notebooks
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
embeddings_requirements.txt		embeddings_requirements.txt
image_requirements.txt		image_requirements.txt
llm_requirements.txt		llm_requirements.txt
training_requirements.txt		training_requirements.txt

a-tabaza/binding_music

Folders and files

Latest commit

History

Repository files navigation

Binding Text, Images, Graphs, and Audio for Music Representation Learning

Repo Structure:

Abstract

Nomic Maps

Text Embedding Maps

Image Embedding Maps

Graph Embedding Maps

Audio Embedding Maps

Multimodal Embedding Maps

About

Topics

Resources

Stars

Watchers

Forks

Languages