Gluon Audio Toolkit

Gluon Audio is a toolkit providing deep learning based audio recognition algorithm. The project is still under development, and only Chinese introduction will be provided.

GluonAR Introduction:

GluonAR is based on MXnet-Gluon, if you are new to it, please check out dmlc 60-minute crash course.

虽然名字叫GluonAR, 但是目前以及可以预见的时间内只有Text-Independent Speaker Recognition的内容.

已经实现的feature:

使用ffmpeg的pythonic binding av和librosa做audio数据读取
模块支持Hybridize(). forward阶段不使用pysound, librosa, scipy, 效率更高, 提供快速训练和end-to-end部署的能力, 包括:
- 基于nd.contrib.fft的短时傅里叶变换(STFTBlock)和z-score block, 相比使用numpy和scipy预处理后载入GPU训练效率提高12%.
- MelSpectrogram, DCT1D, MFCC, PowerToDB
- 1808.00158中提出的SincBlock
gluon风格的VOX数据集载入
类似人脸验证的Speaker Verification
使用频谱图训练声纹特征的例子, 在VOX1上的1:1验证acc: 0.941152+-0.004926

example:

import numpy as np
import mxnet as mx
import librosa as rosa
from gluonar.utils.viz import view_spec
from gluonar.nn.basic_blocks import STFTBlock

data = rosa.load(r"resources/speaker_recognition/speaker0_0.m4a", sr=16000)[0][:35840]
nd_data = mx.nd.array([data], ctx=mx.gpu())

stft = STFTBlock(35840, hop_length=160, win_length=400)
stft.initialize(ctx=mx.gpu())

# stft block forward
ret = stft(nd_data).asnumpy()[0][0]
spec = np.transpose(ret, (1, 0)) ** 2
view_spec(spec)

# stft in librosa 
spec = rosa.stft(data, hop_length=160, win_length=400, window="hamming")
spec = np.abs(spec) ** 2
view_spec(spec)

输出:

STFTBlock	STFT in librosa

更多的例子请参考examples/.

Requirements

mxnet-1.5.0+, gluonfr, av, librosa, ...

音频库的选择主要考虑数据读取速度, 训练过程中音频的解码相比图像解码会消耗更多时间, 实际测试librosa从磁盘加载一个aac编码的短音频耗时是pyav的8倍左右.

librosa
pip install librosa

ffmpeg

# 下载ffmpeg源码, 进入根目录
./configure --extra-cflags=-fPIC --enable-shared
make -j
sudo make install

pyav, 需要先安装ffmpeg
pip install av
gluonfr
pip install git+https://github.com/THUFutureLab/gluon-face.git@master

Datasets

TIMIT

The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT) Training and Test Data. Before using this dataset please follow the instruction on link.

A copy of this was uploaded to Google Drive by @philipperemy here.

VoxCeleb

VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.

For more information, checkout this page.

Pretrained Models

Speaker Recognition

ResNet18 training with VoxCeleb

Download: Baidu, Google Drive

I followed the ideas in paper VoxCeleb2 1806.05622 to train this model, the differences between them:

-	Res18 in this repo	Res34 in paper
Trained on	VoxCeleb2	VoxCeleb2
Input spec size	224x224	512x300
Eval on	Random 9500+ pair samples from VoxCeleb1 train and test set	Original VoxCeleb1 test set
Metric	Accuracy:0.932656+-0.005187	EER: 0.0504
Framework	Mxnet Gluon	Matconvnet
ROC		-

TODO

接下来会慢慢补全使用mxnet gluon训练说话人识别的工具链, 预计会花超长时间.

Docs

Please checkout http://gluon-audio.readthedocs.io/ .

Authors

{ haoxintong }

Discussion

Any suggestions, please open an issue.

Contributes

The final goal of this project is providing an easy using deep learning based audio algorithm library like pytorch-kaldi.

Contribution is welcomed.

References

MXNet Documentation and Tutorials https://zh.diveintodeeplearning.org/

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
docs		docs
examples		examples
gluonar		gluonar
models		models
resources		resources
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

License

haoxintong/gluon-audio

Folders and files

Latest commit

History

Repository files navigation

Gluon Audio Toolkit

GluonAR Introduction:

Requirements

Datasets

TIMIT

VoxCeleb

Pretrained Models

Speaker Recognition

ResNet18 training with VoxCeleb

TODO

Docs

Authors

Discussion

Contributes

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages