[ICLR'25] SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation
Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong, Yuhta Takida, Yuki Mitsufuji
Sony AI, Stanford University, and Sony Group Corporation
ICLR 2025
- Paper: Openreview
- Chekpoints: Hugging Face
Install docker and go through docker/README.md
for building docker container.
Or you can install with pip
cd soundctm_dit_iclr
pip install -e .
With demo.py
python demo.py --prompt "your prompt" --variants "ac_v2" --num_steps 1 --cfg 5.0 --nu 1.0 --output_dir "your output directory"
We have currently two variants
: "ac_v1_iclr" and "ac_v2". Checkpoints will be automatically downloaded from Huggingface.
ac
means the model trained with Audio Caps dataset.v1
uses text embeddings from MLP projection layer (last layer) of CLAP text branch. This is same setup as ICLR'25 publication.v2
uses text embeddings from second last layer of CLAP text branch. This is similar conditioning setup as Stable Audio series.
See the file for more options.
You can try inference with soundctm_generate.ipynb
as well.
- Update the config file
configs/hyperparameter_configs/inference/dit.yaml
with corresponding paths and environment. - Update the
gen_samples.sh
with dataset folder and config file path. - Generate the samples for test dataset with below command
bash gen_samples.sh
Follow the instructions given in the AudioCaps repository for downloading the dataset.
Dataset locations are needed to be spesificied in config files and train_script.sh
.
The train.csv
from audiocaps repo has audiocaps id, youtube id, start time and caption as columns. Convert the csv format to data/train.csv
by naming audio files with youtube id.
example: Convert below format to - audiocap_id youtube_id start_time caption - 91139 r1nicOVtvkQ 130 A woman talks nearby as water pours this. file_name caption Yr1nicOVtvkQ A woman talks nearby as water pours
The training code also requires a Weights & Biases account to log the training outputs and demos. Create an account and log in with:
$ wandb login
Or you can also pass an API key as an environment variable WANDB_API_KEY
.
(You can obtain the API key from https://wandb.ai/authorize after logging in to your account.)
$ WANDB_API_KEY="12345x6789y..."
Or provide the key in ctm_train.py
at line number 51
os.environ['WANDB_API_KEY'] = "12345x6789y..."
- Update the config file
configs/hyperparameter_configs/training/dit.yml
with corresponding paths and environment. - For selection between SoundCTM DIT v1 and SoundCTM DIT v2 use
--version
config parameter. - The primary difference between v1 and v2 versions is usage of CLAP text embeddings for training.
- v1 uses text embeddings from MLP projection layer (last layer) of CLAP text branch.
- v2 uses text embeddings from second last layer of CLAP text branch.
- Add new argument
clap_text_branch_projection : True
in config file for selecting v1, specify false for v2 embeddings. Default value is False.
- Add new argument
- Update the
train_script.sh
with dataset folder and config file path. - Start the training with below command
bash train_script.sh
- Update
evaluate.sh
and provide path for Ground truth samples, generated samples, clap model and eval singularity image. - Execute
evaluate.sh
script for evaluating generated samples
bash evaluate.sh
@inproceedings{saito2025soundctm,
title={Sound{CTM}: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation},
author={Koichi Saito and Dongjun Kim and Takashi Shibuya and Chieh-Hsin Lai and Zhi Zhong and Yuhta Takida and Yuki Mitsufuji},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=KrK6zXbjfO}
}