Skip to content

A repository for generating and training short audio samples with unconditional waveform diffusion on accessible consumer hardware (<2GB VRAM GPU)

License

Notifications You must be signed in to change notification settings

crlandsc/tiny-audio-diffusion

Repository files navigation

Tiny Audio Diffusion

Tiny Audio Diffusion Logo

Hugging Face Spaces Badge YouTube Tutorial Badge Towards Data Science Badge GitHub License GitHub Repo stars GitHub forks

This is a repository for generating short audio samples and training waveform diffusion models on a consumer-grade GPU with less than 2GB VRAM.

Motivation

The purpose of this project is to provide access to stereo high-resolution (44.1kHz) conditional and unconditional audio waveform (1D U-Net) diffusion code for those interested in exploration but who have limited resources. There are many methods for audio generation on low-level hardware, but less so specifically for waveform-based diffusion.

The repository is built heavily adapting code from Archinet's audio-diffusion-pytorch libary. A huge thank you to Flavio Schneider for his incredible open-source work in this field!

Background

Direct waveform diffusion is inherently computationally intensive. For example, an audio sample with the industry standard 44.1kHz sampling rate requires 44,100 samples for just 1 second of audio. Now multiply that by 2 for a stereo file. However, it has a significant advantage over many methods that reduce audio into spectrograms or downsample - the network retains and learns from phase information. Phase is challenging to represent on its own in visual methods, such as spectrograms, as it appears similar to that of random noise. Because of this, many generative methods discard phase information and then implement ways of estimating and regenerating it. However, it plays a key role in defining the timbral qualities of sounds and should not be dispensed with so easily.

Waveform diffusion is able to retain this important feature as it does not perform any transforms on the audio before feeding it into the network. This is how humans perceive sounds, with both amplitude and phase information bundled together in a single signal. As mentioned previously, this comes at the expense of computational requirements and is often reserved for training on a cluster of GPUs with high speeds and lots of memory. Because of this, it is hard to begin to experiment with waveform diffusion with limited resources.

This repository seeks to offer some base code to those looking to experiment with and learn more about waveform diffusion on their own computer without having to purchase cloud resources or upgrade hardware. This goes for not only inference, but training your own models as well!

To make this feasible, however, there must be a tradeoff of quality, speed, and sample length. Because of this, I have focused on training base models for one-shot drum samples - as they are inherently short in sample length.

The current configuration is set up to be able to train ~0.75 second stereo samples at 44.1kHz, allowing for the generation of high-quality one-shot audio samples. The network configuration can be adjusted to improve the resolution, sample rate, training and inference speed, sample length, etc. but, of course, more hardware resources will be required.

Other methods of diffusion, such as diffusion in the latent space (Stable Diffusion's secret sauce), compared to this repo's raw waveform diffusion can offer an improvement and other tradeoffs between quality, memory requirements, speed, etc. I recommend this repo to remain up-to-date with the latest research in generative audio: https://github.com/archinetai/audio-ai-timeline

Also recommended is Harmonai's community project, Dance Diffusion, which implements similar functionality to this repo on a larger scale with several pre-trained models. Colab notebook available.

April 2024 update:

Some additional useful generative audio tools/repos:


Setup

Follow these steps to set up an environment for both generating audio samples and training models.

NOTE: To use this repo with a GPU, you must have a CUDA-capable GPU and have the CUDA toolkit installed for your specific to your system (ex. Linux, x86_64, WSL-Ubuntu). More information can be found here.

1. Create a Virtual Environment:

Ensure that Anaconda (or Miniconda) is installed and activated. From the command line, cd into the setup/ folder and run the following lines:

conda env create -f environment.yml
conda activate tiny-audio-diffusion

This will create and activate a conda environment from the setup/environment.yml file and install the dependencies in setup/requirements.txt.

2. Install Python Kernel For Jupyter Notebook

Run the following line to create a kernel for the current environment to run the inference notebook.

python -m ipykernel install --user --name tiny-audio-diffusion --display-name "tiny-audio-diffusion (Python 3.10)"

3. Define Environment Variables

Rename .env.tmp to .env and replace the entries with your own variables (example values are random).

DIR_LOGS=/logs
DIR_DATA=/data

# Required if using Weights & Biases (W&B) logger
WANDB_PROJECT=tiny_drum_diffusion # Custom W&B name for current project
WANDB_ENTITY=johnsmith # W&B username
WANDB_API_KEY=a21dzbqlybbzccqla4txa21dzbqlybbzccqla4tx # W&B API key

NOTE: Sign up for a Weights & Biases account to log audio samples, spectrograms, and other metrics while training (it's free!).

W&B logging example for this repo here.


Pre-trained Models

Pretrained models can be found on Hugging Face (each model contains a .ckpt and .yaml file):

Model Link
Kicks crlandsc/tiny-audio-diffusion-kicks
Snares crlandsc/tiny-audio-diffusion-snares
Hi-hats crlandsc/tiny-audio-diffusion-hihats
Percussion (all drum types) crlandsc/tiny-audio-diffusion-percussion

See W&B model training metrics here.

Pre-trained models can be downloaded to generate samples via the inference notebook. They can also be used as a base model to fine-tune on custom data. It is recommended to create subfolders within the saved_models folder to store each model's .ckpt and .yaml files.


Inference

Hugging Face Spaces

Generate samples without code on 🤗 Hugging Face Spaces!

Jupyter Notebook

Audio Sample Generation

Current Capabilities:

  • Unconditional Generation
  • Conditional "Style-transfer" Generation

Open the Inference.ipynb in Jupyter Notebook and follow the instructions to generate new audio samples. Ensure that the "tiny-audio-diffusion (Python 3.10)" kernel is active in Jupyter to run the notebook and you have downloaded the pre-trained model of interest from Hugging Face.


Train

The model architecture has been constructed with PyTorch Lightning and Hydra frameworks. All configurations for the model are contained within .yaml files and should be edited there rather than hardcoded.

exp/drum_diffusion.yaml contains the default model configuration. Additional custom model configurations can be added to the exp folder.

Custom models can be trained or fine-tuned on custom datasets. Datasets should consist of a folder of .wav audio files with a 44.1kHz sampling rate.

To train or finetune models, run one of the following commands in the terminal from the repo's root folder and replace <path/to/your/train/data> with the path to your custom training set.

Train model from scratch (on CPU): (not recommended)

python train.py exp=drum_diffusion datamodule.dataset.path=<path/to/your/train/data>

Train model from scratch (on GPU):

python train.py exp=drum_diffusion trainer.gpus=1 datamodule.dataset.path=<path/to/your/train/data>

NOTE: To use this repo with a GPU, you must have a CUDA-capable GPU and have the CUDA toolkit installed specific to your system (ex. Linux, x86_64, WSL-Ubuntu). More information can be found here.

Resume run from a checkpoint (with GPU):

python train.py exp=drum_diffusion trainer.gpus=1 +ckpt=</path/to/checkpoint.ckpt> datamodule.dataset.path=<path/to/your/train/data>

Dataset

The data used to train the checkpoints listed above can be found on 🤗 Hugging Face.

Note: This is a small and unbalanced dataset consisting of free samples that I had from my music production. These samples are not covered under the MIT license of this repository and cannot be used to train any commercial models, but can be used in personal and research contexts.

Note: For appropriately diverse models, larger datasets should be used to avoid memorization of training data.


Repository Structure

The structure of this repository is as follows:

├── main
│   ├── diffusion_module.py     - contains pl model, data loading, and logging functionalities for training
│   └── utils.py                - contains utility functions for training
├── exp
│   └── *.yaml                  - Hydra configuration files
├── setup
│   ├── environment.yml         - file to set up conda environment
│   └── requirements.txt        - contains repo dependencies
├── images                      - directory containing images for README.md
│   └── *.png
├── samples                     - directory containing sample outputs from tiny-audio-diffusion models
│   └── *.wav
├── .env.tmp                    - temporary environment variables (rename to .env)
├── .gitignore
├── README.md
├── Inference.ipynb             - Jupyter notebook for running inference to generate new samples
├── config.yaml                 - Hydra base configs
├── train.py                    - script for training
├── data                        - directory to host custom training data
│   └── wav_dataset
│       └── (*.wav)
└── saved_models                - directory to host model checkpoints and hyper-parameters for inference
    └── (kicks/snare/etc.)
        ├── (*.ckpt)            - pl model checkpoint file
        └── (config.yaml)       - pl model hydra hyperparameters (required for inference)

About

A repository for generating and training short audio samples with unconditional waveform diffusion on accessible consumer hardware (<2GB VRAM GPU)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages