[ICLR'25] SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong, Yuhta Takida, Yuki Mitsufuji

Sony AI, Stanford University, and Sony Group Corporation

ICLR 2025

Paper: Openreview
Chekpoints: Hugging Face

Prerequisites

Install docker and go through docker/README.md for building docker container.

Or you can install with pip

cd soundctm_dit_iclr
pip install -e .

Demo

Command-line interface

With demo.py

python demo.py --prompt "your prompt" --variants "ac_v2" --num_steps 1 --cfg 5.0 --nu 1.0 --output_dir "your output directory"

We have currently two variants: "ac_v1_iclr" and "ac_v2". Checkpoints will be automatically downloaded from Huggingface.

ac means the model trained with Audio Caps dataset.
v1 uses text embeddings from MLP projection layer (last layer) of CLAP text branch. This is same setup as ICLR'25 publication.
v2 uses text embeddings from second last layer of CLAP text branch. This is similar conditioning setup as Stable Audio series.

See the file for more options.

Notebook

You can try inference with soundctm_generate.ipynb as well.

Inference with audiocaps test set

Update the config file configs/hyperparameter_configs/inference/dit.yaml with corresponding paths and environment.
Update the gen_samples.sh with dataset folder and config file path.
Generate the samples for test dataset with below command

bash gen_samples.sh

Dataset

Follow the instructions given in the AudioCaps repository for downloading the dataset. Dataset locations are needed to be spesificied in config files and train_script.sh. The train.csv from audiocaps repo has audiocaps id, youtube id, start time and caption as columns. Convert the csv format to data/train.csv by naming audio files with youtube id.

example: Convert below format to - audiocap_id youtube_id start_time caption - 91139 r1nicOVtvkQ 130 A woman talks nearby as water pours this. file_name caption Yr1nicOVtvkQ A woman talks nearby as water pours

WandB for logging

The training code also requires a Weights & Biases account to log the training outputs and demos. Create an account and log in with:

$ wandb login

Or you can also pass an API key as an environment variable WANDB_API_KEY. (You can obtain the API key from https://wandb.ai/authorize after logging in to your account.)

$ WANDB_API_KEY="12345x6789y..."

Or provide the key in ctm_train.py at line number 51

os.environ['WANDB_API_KEY'] = "12345x6789y..."

Training

Update the config file configs/hyperparameter_configs/training/dit.yml with corresponding paths and environment.
For selection between SoundCTM DIT v1 and SoundCTM DIT v2 use --version config parameter.
The primary difference between v1 and v2 versions is usage of CLAP text embeddings for training.
- v1 uses text embeddings from MLP projection layer (last layer) of CLAP text branch.
- v2 uses text embeddings from second last layer of CLAP text branch.
  - Add new argument clap_text_branch_projection : True in config file for selecting v1, specify false for v2 embeddings. Default value is False.
Update the train_script.sh with dataset folder and config file path.
Start the training with below command

bash train_script.sh

Evaluation of generated samples

Update evaluate.sh and provide path for Ground truth samples, generated samples, clap model and eval singularity image.
Execute evaluate.sh script for evaluating generated samples

bash evaluate.sh

Citation

@inproceedings{saito2025soundctm,
  title={Sound{CTM}: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation},
  author={Koichi Saito and Dongjun Kim and Takashi Shibuya and Chieh-Hsin Lai and Zhi Zhong and Yuhta Takida and Yuki Mitsufuji},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=KrK6zXbjfO}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[ICLR'25] SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

Prerequisites

Demo

Command-line interface

Notebook

Inference with audiocaps test set

Dataset

WandB for logging

Training

Evaluation of generated samples

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
configs		configs
ctm		ctm
data		data
docker		docker
sa_edm		sa_edm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ctm_inference.py		ctm_inference.py
ctm_train.py		ctm_train.py
demo.py		demo.py
evaluate.sh		evaluate.sh
evaluate_metrics.py		evaluate_metrics.py
gen_samples.sh		gen_samples.sh
notebook_utils.py		notebook_utils.py
setup.py		setup.py
soundctm_generate.ipynb		soundctm_generate.ipynb
train_script.sh		train_script.sh

License

koichi-saito-sony/soundctm_dit_iclr

Folders and files

Latest commit

History

Repository files navigation

[ICLR'25] SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

Prerequisites

Demo

Command-line interface

Notebook

Inference with audiocaps test set

Dataset

WandB for logging

Training

Evaluation of generated samples

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages