Speech/Music Discrimination using SampleCNN

Experimenting with sample-level CNNs for Speech/Music discrimination.

Motivation

German public broadcasters such as RBB radio1 offer a myriad of interesting podcasts and programs. One of them is Schöne Töne, a two-hour broadcast dedicated to modern ambient, artcore, and orchestral music.

radio1 allows you to listen to the broadcast later over the internet. However, they don't bother removing the news/service block in middle of the audio track. Of course, I could rip it out by hand, but maybe we could try out a semi-automated way to do this with PyTorch.

Implementation

We train a SampleCNN on the GTZAN Music-Speech discrimination dataset. The reasoning here is that the news segment almost purely consists of human speech. Therefore, a speech/music discrimination model would be a good fit here.

SampleCNN is trained with the recommended dropout value of 0.5; A cyclic learning rate scheduler (CyclicLR) is used to cycle through the range of learning rates and momentum hyperparameters for SGD.

Training samples are continously produced through randomized window sampling. The model receives 59049 samples (around ~3.7s @ 16 kHz) as an input. Each track in GTZAN is 30s long. With 120 tracks, the total number of example windows is ((16000*30)-59049+1)*120 = 50514240. However, training for 20000 batches (with e.g. 32 sample windows per batch, about 640000 examples in total) is sufficient to produce a usable model.

Test

Does it work? Seemingly, yes: Using the 10.10.2019 recording of Schöne Töne as testing material, the discriminator correctly recognizes the ~5-minute long news block:

$ python3 -B discriminator st_20191010.wav 2>/dev/null
00:00:00 🎶 (0.000)
00:00:01 🎶 (0.000)
00:00:02 🎶 (0.000)
00:00:03 🎶 (0.000)
00:00:04 🎶 (0.000)
00:00:05 🎶 (0.000)
…
00:56:21 🎤 (0.964)
00:56:22 🎤 (1.000)
00:56:23 🎤 (1.000)
00:56:24 🎤 (1.000)
00:56:25 🎤 (1.000)
00:56:26 🎤 (1.000)
00:56:27 🎤 (1.000)
00:56:28 🎤 (1.000)
00:56:29 🎤 (1.000)
…
01:00:04 🎤 (0.723)
01:00:05 🎤 (0.909)
01:00:06 🎤 (0.933)
01:00:07 🎶 (0.043)
01:00:08 🎶 (0.009)
01:00:09 🎶 (0.074)
01:00:10 🎶 (0.007)
01:00:11 🎶 (0.003)

Notes

The resulting model overfits to human speech (which is not disadvantageous in this case). Speech with music in the background is therefore classified as music (which could be the correct decision, depending on who you ask). The threshold can be adjusted using the --alpha option in the discriminator.

A trained model is included. However, a Makefile that runs the training workflow and writes the model is provided.

Dependencies

torch
librosa
ffmpeg-python

The latter (ffmpeg-python) is used for audio file conversion and resampling in the discriminator. Note that you must have an ffmpeg binary somewhere in your $PATH for it to work.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
discriminator		discriminator
model-gztan-speech-music-20000.log		model-gztan-speech-music-20000.log
model-gztan-speech-music-20000.pth		model-gztan-speech-music-20000.pth
radio1-schoene-toene-2019-10-10.txt		radio1-schoene-toene-2019-10-10.txt
requirements.txt		requirements.txt
samplecnn.py		samplecnn.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

.gitignore

.gitignore

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

discriminator

discriminator

model-gztan-speech-music-20000.log

model-gztan-speech-music-20000.log

model-gztan-speech-music-20000.pth

model-gztan-speech-music-20000.pth

radio1-schoene-toene-2019-10-10.txt

radio1-schoene-toene-2019-10-10.txt

requirements.txt

requirements.txt

samplecnn.py

samplecnn.py

Repository files navigation

Speech/Music Discrimination using SampleCNN

Motivation

Implementation

Test

Notes

Dependencies

About

Releases

Packages

Languages

License

fmqa/samplecnn-speech-detection

Folders and files

Latest commit

History

Repository files navigation

Speech/Music Discrimination using SampleCNN

Motivation

Implementation

Test

Notes

Dependencies

About

Topics

Resources

License

Stars

Watchers

Forks

Languages