Sound Classification on Rap Vocals and Speech

Overview

Many studies have leveraged the harmonic patterns in music to achieve high accuracy on music/speech classification. However, the rap genre, with its vocal style closely resembling spoken words, blurs these lines with its speech-like qualities. This project will investigate the efficacy of 4 existing pre-trained models + LSTM and 1 CNN+FC (fully connected layers) in discriminating between rap vocals and speech.

Our data is self-collected audio data of speech and rap vocals, which are scrapped from youtube via yt-dlp, followed by Demucs (htdemucs_ft version) for separating target vocals from music tracks.

Dataset

This project uses self-collected data.
The Ultimate_Rap_Dataset_Cleaned has 207 rap songs with a total of 48109 sec ≈ 13.36 hr;
The Ultimate_Speech_Dataset_Cleaned has 172 speech audio files with a total of 76362 sec ≈ 21.21 hr.

Data collection

The Data preparation, data pre-processing, and data cleaning are time-concuming. After having our audio data with JSON files downloaded, we perform vocal separation to extract rap vocals as well as speech from their music tracks. Followed by removing and replacing problematic characters, ensuring compatibility across different systems and software, and preventing errors, we formed our Ultimate datasets.

Rap

A comprehensive list of rap music was curated to ensure a diverse and representative dataset, it included a wide range of rap music from its late 1970s to contemporary innovations. Besides, conscious effort was made to incorporate more songs by female rappers to achieve a more balanced gender distribution.

Speech

We specifically target speech audio that contains background music, applying Demucs for speech separation to maintain consistency between isolated rap vocals and isolated speech in our dataset.

Models

We compare 5 models on this task:

(1) CNN+FC

Representing the process of window slicer with CNN+FC for classification.

(2) YAMnet+LSTM

YAMnet extracting embeddings and feeding them to LSTM for classification.

(3) VGGish+LSTM

VGGish extracting embeddings and feeding them to LSTM for classification.

(4) OpenL3+LSTM

OpenL3 extracting embeddings and feeding them to LSTM for classification.

(5) PANNs+LSTM

PANNs extracting embeddings and feeding them to LSTM for classification.

How to use this repository

Requirements

If using conda:

conda env create -f environment.yml

If using pip:

pip install -r requirements.txt

Run the Notebook Cells

To effectively progress through the model training process, it is crucial to run the cells in your Jupyter notebook sequentially. Each cell in the BetterNotebook.ipynb builds upon the previous ones, from data loading and preprocessing to the final stages of model training. Here are some important points to keep in mind:

Data Reshaping: Different pre-trained models require input tensors of different shapes. Pay attention to the reshaping steps in the notebook to ensure that your data conforms to the required dimensions for each model.
Variable and File Names: In the notebook, variables that store temporary data might have the same names as the .np or .npz files where data is saved. While they share names, their contents at any given point could be different due to ongoing data processing steps.
Saving and Loading Data: Throughout the notebook, data is frequently saved to and loaded from .np (NumPy arrays) or .npz (compressed NumPy array archives) files. Make sure to modify to your path.

Demo

Check out our Colab demo to see how the model identifies three raw rap vocals. The chosen model, PANNs+LSTM, is our best-performing model. The model outputs a probability between 0 and 1, with 0 indicating rap and 1 indicating speech.

For a bit of fun, try recording your own rap vocals and testing them with the model! Use your own audio and see how our classification system handles your unique style.

Result

The four pre-trained embedding extractor models with LSTM, as well as a simple CNN+FC model, achieved rather similar test accuracy, with the PANNs+LSTM and VGGish+LSTM models delivered the best performance. Interestingly, the naive CNN+FC model demonstrated its potential and competitiveness in this task. All models achieved a performance of around 80%-90% in accuracy.

Acknowledgments

Special thanks to my teammates, Junzhe Liu and Nick Lin, for their contributions to debugging and creating the demo. Their collaboration and support have been invaluable to this project.

Citation

Please cite this repo if you find this project helpful for your project/paper:

Chung, F. (2024). Sound Classification on Rap Vocals and Speech. GitHub repository, https://github.com/Vio-Chung/Rap-Speech-Classification.

cff-version: 1.2.0
message: "Please cite it as below if used."
authors:
  - family-names: Chung
    given-names: Fang-Chi (Vio)
    orcid: https://orcid.org/0009-0004-0857-5252
title: "Sound Classification on Rap Vocals and Speech"
version: 1.0.0
date-released: 2024-05-02

Referencess

yt-dlp usage instructions
TensorFlow Models: VGGish
TensorFlow Models: YAMNet
Hybrid Transformers for Music Source Separation.
Rouard, S., Massa, F., & Défossez, A. (2023).
IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5).
Look, Listen and Learn More: Design Choices for Deep Audio Embeddings
Aurora Cramer, Ho-Hsiang Wu, Justin Salamon, and Juan Pablo Bello.
IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 3852–3856, Brighton, UK, May 2019.
Look, Listen and Learn
Relja Arandjelović and Andrew Zisserman.
IEEE International Conference on Computer Vision (ICCV), Venice, Italy, Oct. 2017.
Panns: Large-scale pretrained audio neural networks for audio pattern recognition.
Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley.
IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020): 2880-2894.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
Demo		Demo
checkpoints		checkpoints
images		images
BetterNotebook.ipynb		BetterNotebook.ipynb
LICENSE		LICENSE
README.md		README.md
Ultimate_Rap_Dataset_Cleaned.csv		Ultimate_Rap_Dataset_Cleaned.csv
Ultimate_Speech_Dataset_Cleaned.csv		Ultimate_Speech_Dataset_Cleaned.csv
environment.yml		environment.yml
models.py		models.py
requirements.txt		requirements.txt
utils.py		utils.py

License

Vio-Chung/Rap-Speech-Classification

Folders and files

Latest commit

History

Repository files navigation

Sound Classification on Rap Vocals and Speech

Overview

Dataset

Data collection

Rap

Speech

Models

How to use this repository

Requirements

Run the Notebook Cells

Demo

Result

Acknowledgments

Citation

Referencess

About

Topics

Resources

License

Stars

Watchers

Forks

Languages