GitHub - yangdongchao/SoundStorm: The reproduced code for Google's SoundStorm

SoundStorm: Efficient Parallel Audio Generation (wip)

Unofficial Pytorch implementation of SoundStorm, a Parallel Audio Generation out of Google Research.

Currently, we first provide the first version code by ourselves. We directly use a mask-based discrete diffusion to implement this, which enjoys the same process as Google's paper. For model details, please refer to our paper, InsturctTTS: https://arxiv.org/pdf/2301.13662.pdf

We will soon update the second version based on MASKGIT, which keep the same as SoundStorm.

Overview

Following the paper, we use HuBERT to extract semantic tokens, and then using semantic token as condition to predict all of the acoustic tokens in parallel. Different with SoundStrom to use sum operation to combine the multiple codebook, we use shallow u-net to combine different codebook. For AudioCodec, we use the open source AcademiCodec https://github.com/yangdongchao/AcademiCodec

Prepare dataset

Please refer to data_sample folder to understood how to prepare the dataset.

Training

Firtsly, prepare your data
bash start/start.sh

Inference

Firstly, revise evaluation/generate_samples_batch.py based on your model.
python generate_samples_batch.py

Reference

@article{yang2023instructtts,
  title={InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt},
  author={Yang, Dongchao and Liu, Songxiang and Huang, Rongjie and Lei, Guangzhi and Weng, Chao and Meng, Helen and Yu, Dong},
  journal={arXiv preprint arXiv:2301.13662},
  year={2023}
}

@article{google_soundstorm,
  title={SoundStorm: Efficient Parallel Audio Generation},
  author={Zal´an Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, Marco Tagliasacchi},
  journal={arXiv preprint arXiv:2305},
  year={2023}
}

@article{yang2023hifi,
  title={HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec},
  author={Yang, Dongchao and Liu, Songxiang and Huang, Rongjie and Tian, Jinchuan and Weng, Chao and Zou, Yuexian},
  journal={arXiv preprint arXiv:2305.02765},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 247 Commits
data_sample		data_sample
egs_s1		egs_s1
egs_s2		egs_s2
some_genetated_samples		some_genetated_samples
soundstorm		soundstorm
utils		utils
.gitignore		.gitignore
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_sample

data_sample

egs_s1

egs_s1

egs_s2

egs_s2

some_genetated_samples

some_genetated_samples

soundstorm

soundstorm

utils

utils

.gitignore

.gitignore

readme.md

readme.md

requirements.txt

requirements.txt

Repository files navigation

SoundStorm: Efficient Parallel Audio Generation (wip)

Overview

Prepare dataset

Training

Inference

Reference

About

Releases

Packages

Contributors 3

Languages

yangdongchao/SoundStorm

Folders and files

Latest commit

History

Repository files navigation

SoundStorm: Efficient Parallel Audio Generation (wip)

Overview

Prepare dataset

Training

Inference

Reference

About

Resources

Stars

Watchers

Forks

Languages