Gluon Audio is a toolkit providing deep learning based audio recognition algorithm. The project is still under development, and only Chinese introduction will be provided.
GluonAR is based on MXnet-Gluon, if you are new to it, please check out dmlc 60-minute crash course.
虽然名字叫GluonAR, 但是目前以及可以预见的时间内只有Text-Independent Speaker Recognition的内容.
已经实现的feature:
- 使用ffmpeg的pythonic binding
av
和librosa
做audio数据读取 - 模块支持
Hybridize()
. forward阶段不使用pysound, librosa, scipy, 效率更高, 提供快速训练和end-to-end部署的能力, 包括:- 基于
nd.contrib.fft
的短时傅里叶变换(STFTBlock
)和z-score block, 相比使用numpy和scipy预处理后载入GPU训练效率提高12%. MelSpectrogram
,DCT1D
,MFCC
,PowerToDB
- 1808.00158中提出的
SincBlock
- 基于
- gluon风格的VOX数据集载入
- 类似人脸验证的Speaker Verification
- 使用频谱图训练声纹特征的例子, 在VOX1上的1:1验证acc: 0.941152+-0.004926
example:
import numpy as np
import mxnet as mx
import librosa as rosa
from gluonar.utils.viz import view_spec
from gluonar.nn.basic_blocks import STFTBlock
data = rosa.load(r"resources/speaker_recognition/speaker0_0.m4a", sr=16000)[0][:35840]
nd_data = mx.nd.array([data], ctx=mx.gpu())
stft = STFTBlock(35840, hop_length=160, win_length=400)
stft.initialize(ctx=mx.gpu())
# stft block forward
ret = stft(nd_data).asnumpy()[0][0]
spec = np.transpose(ret, (1, 0)) ** 2
view_spec(spec)
# stft in librosa
spec = rosa.stft(data, hop_length=160, win_length=400, window="hamming")
spec = np.abs(spec) ** 2
view_spec(spec)
输出:
STFTBlock | STFT in librosa |
---|---|
更多的例子请参考examples/
.
mxnet-1.5.0+, gluonfr, av, librosa, ...
音频库的选择主要考虑数据读取速度, 训练过程中音频的解码相比图像解码会消耗更多时间, 实际测试librosa从磁盘加载一个aac编码的短音频 耗时是pyav的8倍左右.
- librosa
pip install librosa
- ffmpeg
# 下载ffmpeg源码, 进入根目录 ./configure --extra-cflags=-fPIC --enable-shared make -j sudo make install
- pyav, 需要先安装ffmpeg
pip install av
- gluonfr
pip install git+https://github.com/THUFutureLab/gluon-face.git@master
The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT) Training and Test Data. Before using this dataset please follow the instruction on link.
A copy of this was uploaded to Google Drive by @philipperemy here.
VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.
For more information, checkout this page.
Download: Baidu, Google Drive
I followed the ideas in paper VoxCeleb2 1806.05622 to train this model, the differences between them:
接下来会慢慢补全使用mxnet gluon训练说话人识别的工具链, 预计会花超长时间.
Please checkout http://gluon-audio.readthedocs.io/ .
{ haoxintong }
Any suggestions, please open an issue.
The final goal of this project is providing an easy using deep learning based audio algorithm library like pytorch-kaldi.
Contribution is welcomed.
- MXNet Documentation and Tutorials https://zh.diveintodeeplearning.org/