End-to-End Synthetic Speech Detection

Important Notice (Oct. 2021)

The results reported in our paper were based on Windows system, while we recently found that the execution of the same repo and dataset on Linux yielded different results, using the pretrained models:

Res-TSSDNet ASVspoof2019 eval EER: 1.6590%;
Inc-TSSDNet ASVspoof2019 eval EER: 4.0384%.

We have identified issues of the package soundfile on Windows when writing and reading flac files, but this problem does not exist on Linux for the same package. The similar problem has been pointed out here.

About

We present two light-weight neural network models, termed time-domain synthetic speech detection net (TSSDNet), having the classic ResNet and Inception Net style structures (Res-TSSDNet and Inc-TSSDNet), for end-to-end synthetic speech detection. They achieve the state-of-the-art performance in terms of equal error rate (EER) on ASVspoof 2019 challenge and are also shown to have promising generalization capability when tested on ASVspoof 2015.

Dataset

ASVspoof 2019 LA partition. link
ASVspoof 2015. link

ASVspoof 2019 train set is used for training;
ASVspoof 2019 dev set is used for model selection;
ASVspoof 2019 eval set is used for testing;
ASVspoof 2015 eval set is used for cross-dataset testing.

Model Architecture

Main Results

The two models with 1.64% and 4.04% eval EER (below), and their train logs, are provided in folder pretrained.

Fixing all hyperparameters, the distribution of the lowest dev (and the corresponding eval) EERs among 100 epochs, trained from scratch (below):

Usage

Data Preparation

ASVspoof15&19_LA_Data_Preparation.py

It generates

equal-duration time domain raw waveform
2D log power of constant Q transform

from ASVspoof2019 and ASVspoof2015 official datasets, respectively. The calculation of CQT is adopted from Li et al. ICASSP 2021.

Training

train.py

It supports training using

standard cross-entropy vs weighted cross-entropy
standard train loader vs mixup regularization
1D raw waveforms vs 2D CQT feature
ASVspoof 2019 training set vs ASVspoof 2015 training set

A train log will be generated, and trained models per epoch will be saved.

Testing

test.py

It generates softmax accuracy, ROC curve, and EER.

Citation Information

G. Hua, A. B. J. Teoh, and H. Zhang, “Towards end-to-end synthetic speech detection,” IEEE Signal Processing Letters, vol. 28, pp. 1265–1269, 2021. arXiv | IEEE Xplore

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.idea		.idea
__pycache__		__pycache__
imgs		imgs
pretrained		pretrained
.gitignore		.gitignore
ASVspoof15&19_LA_Data_Preparation.py		ASVspoof15&19_LA_Data_Preparation.py
LICENSE		LICENSE
README.md		README.md
data.py		data.py
models.py		models.py
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

License

ghua-ac/end-to-end-synthetic-speech-detection

Folders and files

Latest commit

History

Repository files navigation

End-to-End Synthetic Speech Detection

Important Notice (Oct. 2021)

About

Dataset

Model Architecture

Main Results

Usage

Data Preparation

Training

Testing

Citation Information

About

Resources

License

Stars

Watchers

Forks

Languages