Simple audio segmenter to isolate speech portion out of audio streams. Uses a simple feedforward MLP for classification (implemented using tensorflow
) and heuristic smoothing methods to increase the recall of speech segments.
This version modified from brandeis-llc repository to use applause, speech, music, noise, and silence as possible labels, and to handle binary classification of applause (rather than speech).
- System packages:
ffmpeg
- Python packages:
librosa
tensorflow
ortensorflow-gpu
>=2.0.0
numpy
scipy
scikit-learn
ffmpeg-python
I recommend using conda
to install the requirements on Windows (some of the packages are annoying to install on Windows using pip alone!). I prefer Miniconda, a lightweight version of Anaconda.
If you don't want to mess with creating an environment on your local machine, you can run this process in Google Colab, the Google Drive version of Jupyter Notebooks. I've created a sample notebook, available here, for you to use. To make a copy for yourself, go to File > Save Copy in Drive
We provide two pretrained models. Both models are trained on 3-second clips from the MUSAN corpus, HIPSTAS applause samples, and sound from Indiana University collections using the labels: applause
, speech
, music
, noise
, andsilence
. The models are, then, serialized using tensorflow::SavedModel
format. The applause-binary-xxxxxxxx
model is trained to predict applause vs non-applause; the non-binary-xxxxxxxx
model uses all the above labels. Because of the distribution bias in the corpus (a lot fewer noise and silence samples in the training data), we randomly upsampled minority classes.
To train your own model, invoke run.py
with -t
flag and pass the directory name where training data is stored. Each file in your training set should have its label included at the start of the file name, followed by a -
; for example applause-mysound124.wav
(see extract_all
function in feature.py
)
To run the segmenter over audio files, invoke run.py
with -s
flag, and pass 1) model path (feel free to use the pretrained model if needed) and 2) the directory where audio files are stored. Currently it will process all mp3
and wav
files in the target directory. If you want to process other types of audio file, add to or change the file_ext
list near the bottom of run.py
files.
If you want to use binary classification, include the -b
flag.
If you want to specify a minimum length of segment, use the -T
flag and specify a number of milliseconds. Shorter segments will be merged with the previous one (short segments at the beginning will be omitted).
For example:
python run.py -s /path/to/pretrained/applause-binary-20210203 /path/to/audio_dir -o /path/to/output_folder -T 1000 -b
The processed results are stored as JSON file in the target directory named after the audio input. The JSON includes a label and start & end times in seconds. For example:
[
{
"label": "non-applause",
"start": 0.0,
"end": 0.64
},
{
"label": "applause",
"start": 0.65,
"end": 6.78
},
{
"label": "non-applause",
"start": 6.79,
"end": 373.83
},
{
"label": "applause",
"start": 373.84,
"end": 379.55
}
]