Skip to content

Vio-Chung/Rap-Speech-Classification

Repository files navigation

Sound Classification on Rap Vocals and Speech

Overview

Many studies have leveraged the harmonic patterns in music to achieve high accuracy on music/speech classification. However, the rap genre, with its vocal style closely resembling spoken words, blurs these lines with its speech-like qualities. This project will investigate the efficacy of 4 existing pre-trained models + LSTM and 1 CNN+FC (fully connected layers) in discriminating between rap vocals and speech.

Our data is self-collected audio data of speech and rap vocals, which are scrapped from youtube via yt-dlp, followed by Demucs (htdemucs_ft version) for separating target vocals from music tracks.

Dataset

This project uses self-collected data.
The Ultimate_Rap_Dataset_Cleaned has 207 rap songs with a total of 48109 sec ≈ 13.36 hr;
The Ultimate_Speech_Dataset_Cleaned has 172 speech audio files with a total of 76362 sec ≈ 21.21 hr.

Data collection

The Data preparation, data pre-processing, and data cleaning are time-concuming. After having our audio data with JSON files downloaded, we perform vocal separation to extract rap vocals as well as speech from their music tracks. Followed by removing and replacing problematic characters, ensuring compatibility across different systems and software, and preventing errors, we formed our Ultimate datasets.

  • Rap

A comprehensive list of rap music was curated to ensure a diverse and representative dataset, it included a wide range of rap music from its late 1970s to contemporary innovations. Besides, conscious effort was made to incorporate more songs by female rappers to achieve a more balanced gender distribution.

  • Speech

We specifically target speech audio that contains background music, applying Demucs for speech separation to maintain consistency between isolated rap vocals and isolated speech in our dataset.

Models

We compare 5 models on this task:

  (1) CNN+FC Representing the process of window slicer with CNN+FC for classification.
  (2) YAMnet+LSTM YAMnet extracting embeddings and feeding them to LSTM for classification.
  (3) VGGish+LSTM VGGish extracting embeddings and feeding them to LSTM for classification.
  (4) OpenL3+LSTM OpenL3 extracting embeddings and feeding them to LSTM for classification.
  (5) PANNs+LSTM PANNs extracting embeddings and feeding them to LSTM for classification.

How to use this repository

Requirements

If using conda:

conda env create -f environment.yml

If using pip:

pip install -r requirements.txt

Run the Notebook Cells

To effectively progress through the model training process, it is crucial to run the cells in your Jupyter notebook sequentially. Each cell in the BetterNotebook.ipynb builds upon the previous ones, from data loading and preprocessing to the final stages of model training. Here are some important points to keep in mind:

  • Data Reshaping: Different pre-trained models require input tensors of different shapes. Pay attention to the reshaping steps in the notebook to ensure that your data conforms to the required dimensions for each model.
  • Variable and File Names: In the notebook, variables that store temporary data might have the same names as the .np or .npz files where data is saved. While they share names, their contents at any given point could be different due to ongoing data processing steps.
  • Saving and Loading Data: Throughout the notebook, data is frequently saved to and loaded from .np (NumPy arrays) or .npz (compressed NumPy array archives) files. Make sure to modify to your path.

Demo

Check out our Colab demo to see how the model identifies three raw rap vocals. The chosen model, PANNs+LSTM, is our best-performing model. The model outputs a probability between 0 and 1, with 0 indicating rap and 1 indicating speech.

For a bit of fun, try recording your own rap vocals and testing them with the model! Use your own audio and see how our classification system handles your unique style.

Result

The four pre-trained embedding extractor models with LSTM, as well as a simple CNN+FC model, achieved rather similar test accuracy, with the PANNs+LSTM and VGGish+LSTM models delivered the best performance. Interestingly, the naive CNN+FC model demonstrated its potential and competitiveness in this task. All models achieved a performance of around 80%-90% in accuracy.

Results comparision between 5 models.

Acknowledgments

Special thanks to my teammates, Junzhe Liu and Nick Lin, for their contributions to debugging and creating the demo. Their collaboration and support have been invaluable to this project.

Citation

Please cite this repo if you find this project helpful for your project/paper:

Chung, F. (2024). Sound Classification on Rap Vocals and Speech. GitHub repository, https://github.com/Vio-Chung/Rap-Speech-Classification.
cff-version: 1.2.0
message: "Please cite it as below if used."
authors:
  - family-names: Chung
    given-names: Fang-Chi (Vio)
    orcid: https://orcid.org/0009-0004-0857-5252
title: "Sound Classification on Rap Vocals and Speech"
version: 1.0.0
date-released: 2024-05-02

Referencess