Music Tagging

This is the source code for the paper A Multi-scale Convolutional Neural Network Architecture for Music Auto-Tagging, Dabral T.S., Deshmukh A.S., Malapati A. Our work aims to automatically tag the music clips in the MagnaTagATune dataset using a CNN architecture that takes into account the multiple temporal scales at which the musical features express themselves.

Requirements

Python 2
Theano
Librosa
Numpy
tqdm

We recommend the AWS AMI ami-0231c1de0d92fe7a2. Once the AMI is set up and the repository has been cloned, run the following commands to set up the environment:

source activate theano_p27
pip install tqdm
pip install librosa
sudo apt-get install libav-tools

The last command installs the codecs required to read the music files.

We make use of Totem, a library with a Theano backend that facilitates easy creation of feed-forward neural network. The library is a submodule for this git repository and so there is no need to install it separately.

Getting the data

The MagnaTagATune dataset can be downloaded using the following command (in the src) directory:

python get_data.py

This will download the data into the data folder and verify the downloads using its MD5.

Preprocessing

We use the librosa library to preprocess the audio files into log-scaled mel-spectrograms. We use an FFT window size of 2048 and a stride length of 512. The sampling rate for the audio file is 11025. This extraction can be performed by running the following command:

python gen_spectrograms.py

This launches 8 workers to convert the audio files into the spectrograms. The spectrograms are dumped into the data folder.

Model Overview

Our model makes use of three subsampled versions of the spectrograms. A series of convolutions is run on all three versions of the spectrogram, and the three resultant tensors are concatenated depthwise before further convolutions and final prediction. The exact model can be found in the get_model function in trainer.py.

Training

We first merge the synonymous tags as suggested here. In particular, the list of synonymous tags is:

synonyms = [['beat', 'beats'],
            ['chant', 'chanting'],
            ['choir', 'choral'],
            ['classical', 'clasical', 'classic'],
            ['drum', 'drums'],
            ['electro', 'electronic', 'electronica', 'electric'],
            ['fast', 'fast beat', 'quick'],
            ['female', 'female singer', 'female singing', 'female vocals', 'female voice', 'woman', 'woman singing',
            'women'],
            ['flute', 'flutes'],
            ['guitar', 'guitars'],
            ['hard', 'hard rock'],
            ['harpsichord', 'harpsicord'],
            ['heavy', 'heavy metal', 'metal'],
            ['horn', 'horns'],
            ['india', 'indian'],
            ['jazz', 'jazzy'],
            ['male', 'male singer', 'male vocal', 'male vocals', 'male voice', 'man', 'man singing', 'men'],
            ['no beat', 'no drums'],
            ['no singer', 'no singing', 'no vocal', 'no vocals', 'no voice', 'no voices', 'instrumental'],
            ['opera', 'operatic'],
            ['orchestra', 'orchestral'],
            ['quiet', 'silence'],
            ['singer', 'singing'],
            ['space', 'spacey'],
            ['string', 'strings'],
            ['synth', 'synthesizer'],
            ['violin', 'violins'],
            ['vocal', 'vocals', 'voice', 'voices'],
            ['strange', 'weird']]

Our training set and validation set have 18000 and 2000 samples respectively. The remaining ~5800 samples are used as the test set.

We use the ADAM optimizer to optimize the weights of the neural network and train the network for 40 iterations. We start with a learning rate of 0.001 and decay it by a factor of ten at the 20th, 30th and the 35th epoch. Finally, we report the test AUC ROC score corresponding to the best validation score. The entire model is trained on the top 50 tags by frequency.

To run the training routine, run the command:

python trainer.py

This will train the model with the given hyperparameters and will also save the best model in the experiments directory.

Results

Best Validation AUC-ROC score: 0.904

Corresponding test AUC-ROC score: 0.900

PyTorch implementation

For a recent PyTorch reimplemenation of the same model by Amala, check here.

Authors

Citation

@incollection{Dabral2018,
  doi = {10.1007/978-981-13-1592-3_60},
  url = {https://doi.org/10.1007/978-981-13-1592-3_60},
  year = {2018},
  month = dec,
  publisher = {Springer Singapore},
  pages = {757--764},
  author = {Tanmaya Shekhar Dabral and Amala Sanjay Deshmukh and Aruna Malapati},
  title = {A Multi-scale Convolutional Neural Network Architecture for Music Auto-Tagging},
  booktitle = {Advances in Intelligent Systems and Computing}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Totem @ d74d074		Totem @ d74d074
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Totem @ d74d074

Totem @ d74d074

src

src

.gitignore

.gitignore

.gitmodules

.gitmodules

README.md

README.md

Repository files navigation

Music Tagging

Requirements

Getting the data

Preprocessing

Model Overview

Training

Results

PyTorch implementation

Authors

Citation

About

Releases

Packages

Languages

many-facedgod/Music-Tagger

Folders and files

Latest commit

History

Repository files navigation

Music Tagging

Requirements

Getting the data

Preprocessing

Model Overview

Training

Results

PyTorch implementation

Authors

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages