Voice Filter

This is a Tensorflow/Keras implementation of Google AI VoiceFilter.

Our work is inspired from the the academic paper : https://arxiv.org/abs/1810.04826

The implementation is based on the work : https://github.com/mindslab-ai/voicefilter

Team Members

Introduction

We intend to improve the accuracy of Automatic speech recognition(ASR) by separating the speech of the primary speaker. This project has immense application in chatbots, voice assistants, video conferencing.

Who is our primary speaker ?

All users of a service will have to record their voice print during enrolment. The voice print associated with the account is used to identify the primary speaker.

How is voice print recorded ?

A audio clip is processed by a separately trained deep neural network to generate a speaker discriminative embedding. As a result, all speakers are represented by a vector of length 256.

How to prepare Dataset ?

We use the publicly available speech dataset - Librispeech. We select a primary and secondary speaker at random. For the primary speaker, select a random speech for reference and a random speech for input. Select a random speech of the secondary speaker. The input speeches of primary and secondary users are mixed which serves as one of the input. The reference speech is passed through a pre trained model ( Source: https://github.com/mindslab-ai/voicefilter ) to create an embedding which serves as the other input. The output is the input speech of the primary speaker. The speeches are not used directly. Instead, they are converted into magnitude spectrogram before being fed into a deep neural network. We have used python's librosa library to perform all audio related functions.

We created a dataset of 29351 samples that have been divided into 8 parts for ease of use with limited RAM. Link to the kaggle dataset: https://www.kaggle.com/abhinavjain02/speech-separation

Stats on Prepared Data

It took around 11 hours to prepare the dataset on Google Colab. The code is present in the dataset folder.

Note: All ordered pairs of primary and secondary speakers are unique

Stat/Dataset	Train	Dev	Test
Total no. of unique speeches available in LibriSpeech Dataset	28539	2703	2620
No. of unique speeches used	26869	1878	1838
Percentage of total speeches used	94.15 %	69.48 %	70.15 %
Total no. of samples prepared	29351	934	964
No. of samples with same primary and reference speech	376 (1.28 %)	10 (1.07 %)	11 (1.14 %)

Proposed System Architecture

Requirements

This code was tested on Python 3.6.9 with Google Colab.

Other packages can be installed by:
```
pip install -r requirements.txt
```

Model

The model architecture is precisely as per the academic paper mentioned above. The model takes a input spectrogram and d vector(embedding) as input and produces a soft mask which when superimposed on the input spectrogram produces the output spectrogram. The output spectrogram is combined with the input phase to re create the primary speakers audio from the mixed input speech.

Loss Function	Optimizer	Metrics
Mean Squared Error (MSE)	adam	Sound to Distortion Ratio(SDR)

Training

The model was trained on Google Colab for 30 epochs.
Training took about 37 hours on NVIDIA Tesla P100 GPU.

Results

Loss

Validation SDR

Test

Note: The following results are based on model weights after 29th epoch( Peak SDR on validation )

Loss	SDR
0.0104	5.3250

Audio Samples

Listen to the sample audio from the assets/audio_samples folder.

Key learnings:

Processing Audio data using librosa
Creating flexible architechtures using Keras functional API
Using custom generator in keras
Using custom callbacks in keras
Multi-Processing in python

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
configuration		configuration
dataset		dataset
model		model
train_test		train_test
README.md		README.md
VoiceFilter.ipynb		VoiceFilter.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

configuration

configuration

dataset

dataset

model

model

train_test

train_test

README.md

README.md

VoiceFilter.ipynb

VoiceFilter.ipynb

requirements.txt

requirements.txt

Repository files navigation

Voice Filter

Team Members

Introduction

Who is our primary speaker ?

How is voice print recorded ?

How to prepare Dataset ?

Stats on Prepared Data

Proposed System Architecture

Requirements

Model

Training

Results

Audio Samples

Key learnings:

App Snippet

About

Releases

Packages

Languages

jain-abhinav02/VoiceFilter

Folders and files

Latest commit

History

Repository files navigation

Voice Filter

Team Members

Introduction

Who is our primary speaker ?

How is voice print recorded ?

How to prepare Dataset ?

Stats on Prepared Data

Proposed System Architecture

Requirements

Model

Training

Results

Audio Samples

Key learnings:

App Snippet

About

Resources

Stars

Watchers

Forks

Languages