Skip to content

lizfischer/acoustic-classification-segmentation

 
 

Repository files navigation

Acoustic Calssification & Segmentation

Simple audio segmenter to isolate speech portion out of audio streams. Uses a simple feedforward MLP for classification (implemented using tensorflow) and heuristic smoothing methods to increase the recall of speech segments.

This version modified from brandeis-llc repository to use applause, speech, music, noise, and silence as possible labels, and to handle binary classification of applause (rather than speech).

Requirements

  • System packages: ffmpeg
  • Python packages:
    • librosa
    • tensorflow or tensorflow-gpu >=2.0.0
    • numpy
    • scipy
    • scikit-learn
    • ffmpeg-python

Environment Setup

I recommend using conda to install the requirements on Windows (some of the packages are annoying to install on Windows using pip alone!). I prefer Miniconda, a lightweight version of Anaconda.

If you don't want to mess with creating an environment on your local machine, you can run this process in Google Colab, the Google Drive version of Jupyter Notebooks. I've created a sample notebook, available here, for you to use. To make a copy for yourself, go to File > Save Copy in Drive

Training

Pretrained model

We provide two pretrained models. Both models are trained on 3-second clips from the MUSAN corpus, HIPSTAS applause samples, and sound from Indiana University collections using the labels: applause, speech, music, noise, andsilence. The models are, then, serialized using tensorflow::SavedModel format. The applause-binary-xxxxxxxx model is trained to predict applause vs non-applause; the non-binary-xxxxxxxx model uses all the above labels. Because of the distribution bias in the corpus (a lot fewer noise and silence samples in the training data), we randomly upsampled minority classes.

Training pipeline

To train your own model, invoke run.py with -t flag and pass the directory name where training data is stored. Each file in your training set should have its label included at the start of the file name, followed by a -; for example applause-mysound124.wav (see extract_all function in feature.py)

Segmentation

To run the segmenter over audio files, invoke run.py with -s flag, and pass 1) model path (feel free to use the pretrained model if needed) and 2) the directory where audio files are stored. Currently it will process all mp3 and wav files in the target directory. If you want to process other types of audio file, add to or change the file_ext list near the bottom of run.py files.

If you want to use binary classification, include the -b flag. If you want to specify a minimum length of segment, use the -T flag and specify a number of milliseconds. Shorter segments will be merged with the previous one (short segments at the beginning will be omitted).

For example:

python run.py -s /path/to/pretrained/applause-binary-20210203 /path/to/audio_dir -o /path/to/output_folder -T 1000 -b

The processed results are stored as JSON file in the target directory named after the audio input. The JSON includes a label and start & end times in seconds. For example:

[
    {
        "label": "non-applause",
        "start": 0.0,
        "end": 0.64
    },
    {
        "label": "applause",
        "start": 0.65,
        "end": 6.78
    },
    {
        "label": "non-applause",
        "start": 6.79,
        "end": 373.83
    },
    {
        "label": "applause",
        "start": 373.84,
        "end": 379.55
    }
]

About

tensorflow implementation of speech, music, noise, silence, and applause segmentation for audio files

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%