Skip to content

koichi-saito-sony/soundctm_dit_iclr

Repository files navigation

Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong, Yuhta Takida, Yuki Mitsufuji

Sony AI, Stanford University, and Sony Group Corporation

ICLR 2025

Prerequisites

Install docker and go through docker/README.md for building docker container.

Or you can install with pip

cd soundctm_dit_iclr
pip install -e .

Demo

Command-line interface

With demo.py

python demo.py --prompt "your prompt" --variants "ac_v2" --num_steps 1 --cfg 5.0 --nu 1.0 --output_dir "your output directory"

We have currently two variants: "ac_v1_iclr" and "ac_v2". Checkpoints will be automatically downloaded from Huggingface.

  • ac means the model trained with Audio Caps dataset.
  • v1 uses text embeddings from MLP projection layer (last layer) of CLAP text branch. This is same setup as ICLR'25 publication.
  • v2 uses text embeddings from second last layer of CLAP text branch. This is similar conditioning setup as Stable Audio series.

See the file for more options.

Notebook

You can try inference with soundctm_generate.ipynb as well.

Inference with audiocaps test set

  • Update the config file configs/hyperparameter_configs/inference/dit.yaml with corresponding paths and environment.
  • Update the gen_samples.sh with dataset folder and config file path.
  • Generate the samples for test dataset with below command
bash gen_samples.sh

Dataset

Follow the instructions given in the AudioCaps repository for downloading the dataset. Dataset locations are needed to be spesificied in config files and train_script.sh. The train.csv from audiocaps repo has audiocaps id, youtube id, start time and caption as columns. Convert the csv format to data/train.csv by naming audio files with youtube id.

example: Convert below format to - audiocap_id youtube_id start_time caption - 91139 r1nicOVtvkQ 130 A woman talks nearby as water pours this. file_name caption Yr1nicOVtvkQ A woman talks nearby as water pours

WandB for logging

The training code also requires a Weights & Biases account to log the training outputs and demos. Create an account and log in with:

$ wandb login

Or you can also pass an API key as an environment variable WANDB_API_KEY. (You can obtain the API key from https://wandb.ai/authorize after logging in to your account.)

$ WANDB_API_KEY="12345x6789y..."

Or provide the key in ctm_train.py at line number 51

os.environ['WANDB_API_KEY'] = "12345x6789y..."

Training

  • Update the config file configs/hyperparameter_configs/training/dit.yml with corresponding paths and environment.
  • For selection between SoundCTM DIT v1 and SoundCTM DIT v2 use --version config parameter.
  • The primary difference between v1 and v2 versions is usage of CLAP text embeddings for training.
    • v1 uses text embeddings from MLP projection layer (last layer) of CLAP text branch.
    • v2 uses text embeddings from second last layer of CLAP text branch.
      • Add new argument clap_text_branch_projection : True in config file for selecting v1, specify false for v2 embeddings. Default value is False.
  • Update the train_script.sh with dataset folder and config file path.
  • Start the training with below command
bash train_script.sh

Evaluation of generated samples

  • Update evaluate.sh and provide path for Ground truth samples, generated samples, clap model and eval singularity image.
  • Execute evaluate.sh script for evaluating generated samples
bash evaluate.sh

Citation

@inproceedings{saito2025soundctm,
  title={Sound{CTM}: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation},
  author={Koichi Saito and Dongjun Kim and Takashi Shibuya and Chieh-Hsin Lai and Zhi Zhong and Yuhta Takida and Yuki Mitsufuji},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=KrK6zXbjfO}
}

About

Pytorch implementation of SoundCTM-DiT

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published