Skip to content

Latest commit

 

History

History
2120 lines (1571 loc) · 85.8 KB

ConfigurationSchemas.md

File metadata and controls

2120 lines (1571 loc) · 85.8 KB

Configuration Schemas

The configuration system

DiffSinger uses a cascading configuration system based on YAML files. All configuration files originally inherit and override configs/base.yaml, and each file directly override another file by setting the base_config attribute. The overriding rules are:

  • Configuration keys with the same path and the same name will be replaced. Other paths and names will be merged.
  • All configurations in the inheritance chain will be squashed (via the rule above) as the final configuration.
  • The trainer will save the final configuration in the experiment directory, which is detached from the chain and made independent from other configuration files.

Configurable parameters

This following are the meaning and usages of all editable keys in a configuration file.

Each configuration key (including nested keys) are described with a brief explanation and several attributes listed as follows:

Attribute Explanation
visibility Represents what kind(s) of models and tasks this configuration belongs to.
scope The scope of effects of the configuration, indicating what it can influence within the whole pipeline. Possible values are:
nn - This configuration is related to how the neural networks are formed and initialized. Modifying it will result in failure when loading or resuming from checkpoints.
preprocessing - This configuration controls how raw data pieces or inference inputs are converted to inputs of neural networks. Binarizers should be re-run if this configuration is modified.
training - This configuration describes the training procedures. Most training configurations can affect training performance, memory consumption, device utilization and loss calculation. Modifying training-only configurations will not cause severe inconsistency or errors in most situations.
inference - This configuration describes the calculation logic through the model graph. Changing it can lead to inconsistent or wrong outputs of inference or validation.
others - Other configurations not discussed above. Will have different effects according to the descriptions.
customizability The level of customizability of the configuration. Possible values are:
required - This configuration must be set or modified according to the actual situation or condition, otherwise errors can be raised.
recommended - It is recommended to adjust this configuration according to the dataset, requirements, environment and hardware. Most functionality-related and feature-related configurations are at this level, and all configurations in this level are widely tested with different values. However, leaving it unchanged will not cause problems.
normal - There is no need to modify it as the default value is carefully tuned and widely validated. However, one can still use another value if there are some special requirements or situations.
not recommended - No other values except the default one of this configuration are tested. Modifying it will not cause errors, but may cause unpredictable or significant impacts to the pipelines.
reserved - This configuration must not be modified. It appears in the configuration file only for future scalability, and currently changing it will result in errors.
type Value type of the configuration. Follows the syntax of Python type hints.
constraints Value constraints of the configuration.
default Default value of the configuration. Uses YAML value syntax.

accumulate_grad_batches

Indicates that gradients of how many training steps are accumulated before each optimizer.step() call. 1 means no gradient accumulation.

visibilityall
scopetraining
customizabilityrecommended
typeint
default1

audio_num_mel_bins

Number of mel channels for the mel-spectrogram.

visibilityacoustic
scopenn, preprocessing, inference
customizabilityreserved
typeint
default128

audio_sample_rate

Sampling rate of waveforms.

visibilityacoustic, variance
scopepreprocessing
customizabilityreserved
typeint
default44100

augmentation_args

Arguments for data augmentation.

typedict

augmentation_args.fixed_pitch_shifting

Arguments for fixed pitch shifting augmentation.

typedict

augmentation_args.fixed_pitch_shifting.enabled

Whether to apply fixed pitch shifting augmentation.

visibilityacoustic
scopepreprocessing
customizabilityrecommended
typebool
defaultfalse
constraintsMust be false if augmentation_args.random_pitch_shifting.enabled is set to true.

augmentation_args.fixed_pitch_shifting.scale

Scale ratio of each target in fixed pitch shifting augmentation.

visibilityacoustic
scopepreprocessing
customizabilityrecommended
typetuple
default0.5

augmentation_args.fixed_pitch_shifting.targets

Targets (in semitones) of fixed pitch shifting augmentation.

visibilityacoustic
scopepreprocessing
customizabilitynot recommended
typetuple
default[-5.0, 5.0]

augmentation_args.random_pitch_shifting

Arguments for random pitch shifting augmentation.

typedict

augmentation_args.random_pitch_shifting.enabled

Whether to apply random pitch shifting augmentation.

visibilityacoustic
scopepreprocessing
customizabilityrecommended
typebool
defaulttrue
constraintsMust be false if augmentation_args.fixed_pitch_shifting.enabled is set to true.

augmentation_args.random_pitch_shifting.range

Range of the random pitch shifting ( in semitones).

visibilityacoustic
scopepreprocessing
customizabilitynot recommended
typetuple
default[-5.0, 5.0]

augmentation_args.random_pitch_shifting.scale

Scale ratio of the random pitch shifting augmentation.

visibilityacoustic
scopepreprocessing
customizabilityrecommended
typefloat
default0.75

augmentation_args.random_time_stretching

Arguments for random time stretching augmentation.

typedict

augmentation_args.random_time_stretching.enabled

Whether to apply random time stretching augmentation.

visibilityacoustic
scopepreprocessing
customizabilityrecommended
typebool
defaulttrue

augmentation_args.random_time_stretching.range

Range of random time stretching factors.

visibilityacoustic
scopepreprocessing
customizabilitynot recommended
typetuple
default[0.5, 2]

augmentation_args.random_time_stretching.scale

Scale ratio of random time stretching augmentation.

visibilityacoustic
scopepreprocessing
customizabilityrecommended
typefloat
default0.75

backbone_type

Backbone type of the main decoder/predictor module.

visibilityacoustic, variance
scopenn
customizabilityreserved
typestr
defaultwavenet

base_config

Path(s) of other config files that the current config is based on and will override.

scopeothers
typeUnion[str, list]

binarization_args

Arguments for binarizers.

typedict

binarization_args.num_workers

Number of worker subprocesses when running binarizers. More workers can speed up the preprocessing but will consume more memory. 0 means the main processing doing everything.

visibilityall
scopepreprocessing
customizabilityrecommended
typeint
default1

binarization_args.prefer_ds

Whether to prefer loading attributes and parameters from DS files.

visibilityvariance
scopepreprocessing
customizabilityrecommended
typebool
defaultFalse

binarization_args.shuffle

Whether binarized dataset will be shuffled or not.

visibilityall
scopepreprocessing
customizabilitynormal
typebool
defaulttrue

binarizer_cls

Binarizer class name.

visibilityall
scopepreprocessing
customizabilityreserved
typestr

binary_data_dir

Path to the binarized dataset.

visibilityall
scopepreprocessing, training
customizabilityrequired
typestr

breathiness_db_max

Maximum breathiness value in dB used for normalization to [-1, 1].

visibilityvariance
scopeinference
customizabilityrecommended
typefloat
default-20.0

breathiness_db_min

Minimum breathiness value in dB used for normalization to [-1, 1].

visibilityacoustic, variance
scopeinference
customizabilityrecommended
typefloat
default-96.0

breathiness_smooth_width

Length of sinusoidal smoothing convolution kernel (in seconds) on extracted breathiness curve.

visibilityacoustic, variance
scopepreprocessing
customizabilitynormal
typefloat
default0.12

clip_grad_norm

The value at which to clip gradients. Equivalent to gradient_clip_val in lightning.pytorch.Trainer.

visibilityall
scopetraining
customizabilitynot recommended
typefloat
default1

dataloader_prefetch_factor

Number of batches loaded in advance by each torch.utils.data.DataLoader worker.

visibilityall
scopetraining
customizabilitynormal
typebool
defaulttrue

dataset_size_key

The key that indexes the binarized metadata to be used as the sizes when batching by size

visibilityall
scopetraining
customizabilitynot recommended
typestr
defaultlengths

dictionary

Path to the word-phoneme mapping dictionary file. Training data must fully cover phonemes in the dictionary.

visibilityacoustic, variance
scopepreprocessing
customizabilitynormal
typestr

diff_accelerator

DDPM sampling acceleration method. The following methods are currently available:

visibilityacoustic, variance
scopeinference
customizabilitynormal
typestr
defaultdpm-solver
constraintsChoose from 'ddim', 'pndm', 'dpm-solver', 'unipc'.

diff_speedup

DDPM sampling speed-up ratio. 1 means no speeding up.

visibilityacoustic, variance
scopeinference
customizabilitynormal
typeint
default10
constraintsMust be a factor of K_step.

diffusion_type

The type of ODE-based generative model algorithm. The following models are currently available:

visibilityacoustic, variance
scopenn
customizabilitynormal
typestr
defaultreflow
constraintsChoose from 'ddpm', 'reflow'.

dilation_cycle_length

Length k of the cycle $2^0, 2^1 ...., 2^k$ of convolution dilation factors through WaveNet residual blocks.

visibilityacoustic
scopenn
customizabilitynot recommended
typeint
default4

dropout

Dropout rate in some FastSpeech2 modules.

visibilityacoustic, variance
scopenn
customizabilitynot recommended
typefloat
default0.1

ds_workers

Number of workers of torch.utils.data.DataLoader.

visibilityall
scopetraining
customizabilitynormal
typeint
default4

dur_prediction_args

Arguments for phoneme duration prediction.

typedict

dur_prediction_args.arch

Architecture of duration predictor.

visibilityvariance
scopenn
customizabilityreserved
typestr
defaultfs2
constraintsChoose from 'fs2'.

dur_prediction_args.dropout

Dropout rate in duration predictor of FastSpeech2.

visibilityvariance
scopenn
customizabilitynot recommended
typefloat
default0.1

dur_prediction_args.hidden_size

Dimensions of hidden layers in duration predictor of FastSpeech2.

visibilityvariance
scopenn
customizabilitynormal
typeint
default512

dur_prediction_args.kernel_size

Kernel size of convolution layers of duration predictor of FastSpeech2.

visibilityvariance
scopenn
customizabilitynormal
typeint
default3

dur_prediction_args.lambda_pdur_loss

Coefficient of single phone duration loss when calculating joint duration loss.

visibilityvariance
scopetraining
customizabilitynormal
typefloat
default0.3

dur_prediction_args.lambda_sdur_loss

Coefficient of sentence duration loss when calculating joint duration loss.

visibilityvariance
scopetraining
customizabilitynormal
typefloat
default3.0

dur_prediction_args.lambda_wdur_loss

Coefficient of word duration loss when calculating joint duration loss.

visibilityvariance
scopetraining
customizabilitynormal
typefloat
default1.0

dur_prediction_args.log_offset

Offset for log domain duration loss calculation, where the following transformation is applied: $$ D' = \ln{(D+d)} $$ with the offset value $d$.

visibilityvariance
scopetraining
customizabilitynot recommended
typefloat
default1.0

dur_prediction_args.loss_type

Underlying loss type of duration loss.

visibilityvariance
scopetraining
customizabilitynormal
typestr
defaultmse
constraintsChoose from 'mse', 'huber'.

dur_prediction_args.num_layers

Number of duration predictor layers.

visibilityvariance
scopenn
customizabilitynormal
typeint
default5

enc_ffn_kernel_size

Size of TransformerFFNLayer convolution kernel size in FastSpeech2 encoder.

visibilityacoustic, variance
scopenn
customizabilitynot recommended
typeint
default9

enc_layers

Number of FastSpeech2 encoder layers.

visibilityacoustic, variance
scopenn
customizabilitynormal
typeint
default4

energy_db_max

Maximum energy value in dB used for normalization to [-1, 1].

visibilityvariance
scopeinference
customizabilityrecommended
typefloat
default-12.0

energy_db_min

Minimum energy value in dB used for normalization to [-1, 1].

visibilityvariance
scopeinference
customizabilityrecommended
typefloat
default-96.0

energy_smooth_width

Length of sinusoidal smoothing convolution kernel (in seconds) on extracted energy curve.

visibilityacoustic, variance
scopepreprocessing
customizabilitynormal
typefloat
default0.12

f0_max

Maximum base frequency (F0) in Hz for pitch extraction.

visibilityacoustic, variance
scopepreprocessing
customizabilitynormal
typeint
default1100

f0_min

Minimum base frequency (F0) in Hz for pitch extraction.

visibilityacoustic, variance
scopepreprocessing
customizabilitynormal
typeint
default65

ffn_act

Activation function of TransformerFFNLayer in FastSpeech2 encoder:

  • torch.nn.ReLU if 'relu'
  • torch.nn.GELU if 'gelu'
  • torch.nn.SiLU if 'swish'
visibilityacoustic, variance
scopenn
customizabilitynot recommended
typestr
defaultgelu
constraintsChoose from 'relu', 'gelu', 'swish'.

fft_size

Fast Fourier Transforms parameter for mel extraction.

visibilityacoustic, variance
scopepreprocessing
customizabilityreserved
typeint
default2048

finetune_enabled

Whether to finetune from a pretrained model.

visibilityall
scopetraining
customizabilitynormal
typebool
defaultFalse

finetune_ckpt_path

Path to the pretrained model for finetuning.

visibilityall
scopetraining
customizabilitynormal
typestr
defaultnull

finetune_ignored_params

Prefixes of parameter key names in the state dict of the pretrained model that need to be dropped before finetuning.

visibilityall
scopetraining
customizabilitynormal
typelist

finetune_strict_shapes

Whether to raise error if the tensor shapes of any parameter of the pretrained model and the target model mismatch. If set to False, parameters with mismatching shapes will be skipped.

visibilityall
scopetraining
customizabilitynormal
typebool
defaultTrue

fmax

Maximum frequency of mel extraction.

visibilityacoustic
scopepreprocessing
customizabilityreserved
typeint
default16000

fmin

Minimum frequency of mel extraction.

visibilityacoustic
scopepreprocessing
customizabilityreserved
typeint
default40

freezing_enabled

Whether enabling parameter freezing during training.

visibilityall
scopetraining
customizabilitynormal
typebool
defaultFalse

frozen_params

Parameter name prefixes to freeze during training.

visibilityall
scopetraining
customizabilitynormal
typelist
default[]

glide_embed_scale

The scale factor to be multiplied on the glide embedding values for melody encoder.

visibilityvariance
scopenn
customizabilitynot recommended
typefloat
default11.313708498984760

glide_types

Type names of glide notes.

visibilityvariance
scopepreprocessing
customizabilitynormal
typelist
default[up, down]

hidden_size

Dimension of hidden layers of FastSpeech2, token and parameter embeddings, and diffusion condition.

visibilityacoustic, variance
scopenn
customizabilitynormal
typeint
default256

hop_size

Hop size or step length (in number of waveform samples) of mel and feature extraction.

visibilityacoustic, variance
scopepreprocessing
customizabilityreserved
typeint
default512

lambda_aux_mel_loss

Coefficient of aux mel loss when calculating total loss of acoustic model with shallow diffusion.

visibilityacoustic
scopetraining
customizabilitynormal
typefloat
default0.2

lambda_dur_loss

Coefficient of duration loss when calculating total loss of variance model.

visibilityvariance
scopetraining
customizabilitynormal
typefloat
default1.0

lambda_pitch_loss

Coefficient of pitch loss when calculating total loss of variance model.

visibilityvariance
scopetraining
customizabilitynormal
typefloat
default1.0

lambda_var_loss

Coefficient of variance loss (all variance parameters other than pitch, like energy, breathiness, etc.) when calculating total loss of variance model.

visibilityvariance
scopetraining
customizabilitynormal
typefloat
default1.0

K_step

Maximum number of DDPM steps used by shallow diffusion.

visibilityacoustic
scopetraining
customizabilityrecommended
typeint
default400

K_step_infer

Number of DDPM steps used during shallow diffusion inference. Normally set as same as K_step.

visibilityacoustic
scopeinference
customizabilityrecommended
typeint
default400
constraintsShould be no larger than K_step.

log_interval

Controls how often to log within training steps. Equivalent to log_every_n_steps in lightning.pytorch.Trainer.

visibilityall
scopetraining
customizabilitynormal
typeint
default100

lr_scheduler_args

Arguments of learning rate scheduler. Keys will be used as keyword arguments of the __init__() method of lr_scheduler_args.scheduler_cls.

typedict

lr_scheduler_args.scheduler_cls

Learning rate scheduler class name.

visibilityall
scopetraining
customizabilitynot recommended
typestr
defaulttorch.optim.lr_scheduler.StepLR

main_loss_log_norm

Whether to use log-normalized weight for the main loss. This is similar to the method in the Stable Diffusion 3 paper Scaling Rectified Flow Transformers for High-Resolution Image Synthesis.

visibilityacoustic, variance
scopetraining
customizabilitynormal
typebool

main_loss_type

Loss type of the main decoder/predictor.

visibilityacoustic, variance
scopetraining
customizabilitynot recommended
typestr
defaultl2
constraintsChoose from 'l1', 'l2'.

max_batch_frames

Maximum number of data frames in each training batch. Used to dynamically control the batch size.

visibilityacoustic, variance
scopetraining
customizabilityrecommended
typeint
default80000

max_batch_size

The maximum training batch size.

visibilityall
scopetraining
customizabilityrecommended
typeint
default48

max_beta

Max beta of the DDPM noise schedule.

visibilityacoustic, variance
scopenn, inference
customizabilitynormal
typefloat
default0.02

max_updates

Stop training after this number of steps. Equivalent to max_steps in lightning.pytorch.Trainer.

visibilityall
scopetraining
customizabilityrecommended
typeint
default320000

max_val_batch_frames

Maximum number of data frames in each validation batch.

visibilityacoustic, variance
scopetraining
customizabilitynormal
typeint
default60000

max_val_batch_size

The maximum validation batch size.

visibilityall
scopetraining
customizabilitynormal
typeint
default1

mel_base

The logarithmic base of mel spectrogram calculation.

visibilityacoustic
scopepreprocessing
customizabilitynot recommended
typestr
defaulte

mel_vmax

Maximum mel spectrogram heatmap value for TensorBoard plotting.

visibilityacoustic
scopetraining
customizabilitynot recommended
typefloat
default1.5

mel_vmin

Minimum mel spectrogram heatmap value for TensorBoard plotting.

visibilityacoustic
scopetraining
customizabilitynot recommended
typefloat
default-6.0

melody_encoder_args

Arguments for melody encoder. Available sub-keys: hidden_size, enc_layers, enc_ffn_kernel_size, ffn_act, dropout, num_heads, use_pos_embed, rel_pos. If either of the parameter does not exist in this configuration key, it inherits from the linguistic encoder.

typedict

midi_smooth_width

Length of sinusoidal smoothing convolution kernel (in seconds) on the step function representing MIDI sequence for base pitch calculation.

visibilityvariance
scopepreprocessing
customizabilitynormal
typefloat
default0.06

nccl_p2p

Whether to enable P2P when using NCCL as the backend. Turn it to false if the training process is stuck upon beginning.

visibilityall
scopetraining
customizabilitynormal
typebool
defaulttrue

num_ckpt_keep

Number of newest checkpoints kept during training.

visibilityall
scopetraining
customizabilitynormal
typeint
default5

num_heads

The number of attention heads of torch.nn.MultiheadAttention in FastSpeech2 encoder.

visibilityacoustic, variance
scopenn
customizabilitynot recommended
typeint
default2

num_sanity_val_steps

Number of sanity validation steps at the beginning.

visibilityall
scopetraining
customizabilityreserved
typeint
default1

num_spk

Maximum number of speakers in multi-speaker models.

visibilityacoustic, variance
scopenn
customizabilityrequired
typeint
default1

num_valid_plots

Number of validation plots in each validation. Plots will be chosen from the start of the validation set.

visibilityacoustic, variance
scopetraining
customizabilityrecommended
typeint
default10

optimizer_args

Arguments of optimizer. Keys will be used as keyword arguments of the __init__() method of optimizer_args.optimizer_cls.

typedict

optimizer_args.optimizer_cls

Optimizer class name

visibilityall
scopetraining
customizabilityreserved
typestr
defaulttorch.optim.AdamW

pe

Pitch extractor type.

visibilityall
scopepreprocessing
customizabilitynormal
typestr
defaultparselmouth
constraintsChoose from 'parselmouth', 'rmvpe', 'harvest'.

pe_ckpt

Checkpoint or model path of NN-based pitch extractor.

visibilityall
scopepreprocessing
customizabilitynormal
typestr

permanent_ckpt_interval

The interval (in number of training steps) of permanent checkpoints. Permanent checkpoints will not be removed even if they are not the newest ones.

visibilityall
scopetraining
typeint
default40000

permanent_ckpt_start

Checkpoints will be marked as permanent every permanent_ckpt_interval training steps after this number of training steps.

visibilityall
scopetraining
typeint
default120000

pitch_prediction_args

Arguments for pitch prediction.

typedict

pitch_prediction_args.dilation_cycle_length

Equivalent to dilation_cycle_length but only for the pitch predictor model.

visibilityvariance
default5

pitch_prediction_args.pitd_clip_max

Maximum clipping value (in semitones) of pitch delta between actual pitch and base pitch.

visibilityvariance
scopeinference
typefloat
default12.0

pitch_prediction_args.pitd_clip_min

Minimum clipping value (in semitones) of pitch delta between actual pitch and base pitch.

visibilityvariance
scopeinference
typefloat
default-12.0

pitch_prediction_args.pitd_norm_max

Maximum pitch delta value in semitones used for normalization to [-1, 1].

visibilityvariance
scopeinference
customizabilityrecommended
typefloat
default8.0

pitch_prediction_args.pitd_norm_min

Minimum pitch delta value in semitones used for normalization to [-1, 1].

visibilityvariance
scopeinference
customizabilityrecommended
typefloat
default-8.0

pitch_prediction_args.repeat_bins

Number of repeating bins in the pitch predictor.

visibilityvariance
scopenn, inference
customizabilityrecommended
typeint
default64

pitch_prediction_args.residual_channels

Equivalent to residual_channels but only for the pitch predictor.

visibilityvariance
default256

pitch_prediction_args.residual_layers

Equivalent to residual_layers but only for the pitch predictor.

visibilityvariance
default20

pl_trainer_accelerator

Type of Lightning trainer hardware accelerator.

visibilityall
scopetraining
customizabilitynot recommended
typestr
defaultauto
constraintsSee Accelerator — PyTorch Lightning 2.X.X documentation for available values.

pl_trainer_devices

To determine on which device(s) model should be trained.

'auto' will utilize all visible devices defined with the CUDA_VISIBLE_DEVICES environment variable, or utilize all available devices if that variable is not set. Otherwise, it behaves like CUDA_VISIBLE_DEVICES which can filter out visible devices.

visibilityall
scopetraining
customizabilitynot recommended
typestr
defaultauto

pl_trainer_precision

The computation precision of training.

visibilityall
scopetraining
customizabilitynormal
typestr
default16-mixed
constraintsChoose from '32-true', 'bf16-mixed', '16-mixed'. See more possible values at Trainer — PyTorch Lightning 2.X.X documentation.

pl_trainer_num_nodes

Number of nodes in the training cluster of Lightning trainer.

visibilityall
scopetraining
customizabilityreserved
typeint
default1

pl_trainer_strategy

Arguments of Lightning Strategy. Values will be used as keyword arguments when constructing the Strategy object.

typedict

pl_trainer_strategy.name

Strategy name for the Lightning trainer.

visibilityall
scopetraining
customizabilityreserved
typestr
defaultauto

predict_breathiness

Whether to enable breathiness prediction.

visibilityvariance
scopenn, preprocessing, training, inference
customizabilityrecommended
typebool
defaultfalse

predict_dur

Whether to enable phoneme duration prediction.

visibilityvariance
scopenn, preprocessing, training, inference
customizabilityrecommended
typebool
defaulttrue

predict_energy

Whether to enable energy prediction.

visibilityvariance
scopenn, preprocessing, training, inference
customizabilityrecommended
typebool
defaultfalse

predict_pitch

Whether to enable pitch prediction.

visibilityvariance
scopenn, preprocessing, training, inference
customizabilityrecommended
typebool
defaulttrue

predict_tension

Whether to enable tension prediction.

visibilityvariance
scopenn, preprocessing, training, inference
customizabilityrecommended
typebool
defaulttrue

predict_voicing

Whether to enable voicing prediction.

visibilityvariance
scopenn, preprocessing, training, inference
customizabilityrecommended
typebool
defaulttrue

raw_data_dir

Path(s) to the raw dataset including wave files, transcriptions, etc.

visibilityall
scopepreprocessing
customizabilityrequired
typestr, List[str]

rel_pos

Whether to use relative positional encoding in FastSpeech2 module.

visibilityacoustic, variance
scopenn
customizabilitynot recommended
typeboolean
defaulttrue

residual_channels

Number of dilated convolution channels in residual blocks in WaveNet.

visibilityacoustic
scopenn
customizabilitynormal
typeint
default512

residual_layers

Number of residual blocks in WaveNet.

visibilityacoustic
scopenn
customizabilitynormal
typeint
default20

sampler_frame_count_grid

The batch sampler applies an algorithm called sorting by similar length when collecting batches. Data samples are first grouped by their approximate lengths before they get shuffled within each group. Assume this value is set to $L_{grid}$, the approximate length of a data sample with length $L_{real}$ can be calculated through the following expression:

$$ L_{approx} = \lfloor\frac{L_{real}}{L_{grid}}\rfloor\cdot L_{grid} $$

Training performance on some datasets may be very sensitive to this value. Change it to 1 (completely sorted by length without shuffling) to get the best performance in theory.

visibilityacoustic, variance
scopetraining
customizabilitynormal
typeint
default6

sampling_algorithm

The algorithm to solve the ODE of Rectified Flow. The following methods are currently available:

  • Euler: The Euler method.
  • Runge-Kutta (order 2): The 2nd-order Runge-Kutta method.
  • Runge-Kutta (order 4): The 4th-order Runge-Kutta method.
  • Runge-Kutta (order 5): The 5th-order Runge-Kutta method.
visibilityacoustic, variance
scopeinference
customizabilitynormal
typestr
defaulteuler
constraintsChoose from 'euler', 'rk2', 'rk4', 'rk5'.

sampling_steps

The total sampling steps to solve the ODE of Rectified Flow. Note that this value may not equal to NFE (Number of Function Evaluations) because some methods may require more than one function evaluation per step.

visibilityacoustic, variance
scopeinference
customizabilitynormal
typeint
default20

schedule_type

The DDPM schedule type.

visibilityacoustic, variance
scopenn
customizabilitynot recommended
typestr
defaultlinear
constraintsChoose from 'linear', 'cosine'.

shallow_diffusion_args

Arguments for shallow diffusion.

typedict

shallow_diffusion_args.aux_decoder_arch

Architecture type of the auxiliary decoder.

visibilityacoustic
scopenn
customizabilityreserved
typestr
defaultconvnext
constraintsChoose from 'convnext'.

shallow_diffusion_args.aux_decoder_args

Keyword arguments for dynamically constructing the auxiliary decoder.

visibilityacoustic
scopenn
typedict

shallow_diffusion_args.aux_decoder_grad

Scale factor of the gradients from the auxiliary decoder to the encoder.

visibilityacoustic
scopetraining
customizabilitynormal
typefloat
default0.1

shallow_diffusion_args.train_aux_decoder

Whether to forward and backward the auxiliary decoder during training. If set to false, the auxiliary decoder hangs in the memory and does not get any updates.

visibilityacoustic
scopetraining
customizabilitynormal
typebool
defaulttrue

shallow_diffusion_args.train_diffusion

Whether to forward and backward the diffusion (main) decoder during training. If set to false, the diffusion decoder hangs in the memory and does not get any updates.

visibilityacoustic
scopetraining
customizabilitynormal
typebool
defaulttrue

shallow_diffusion_args.val_gt_start

Whether to use the ground truth as x_start in the shallow diffusion validation process. If set to true, gaussian noise is added to the ground truth before shallow diffusion is performed; otherwise the noise is added to the output of the auxiliary decoder. This option is useful when the auxiliary decoder has not been trained yet.

visibilityacoustic
scopetraining
customizabilitynormal
typebool
defaultfalse

sort_by_len

Whether to apply the sorting by similar length algorithm described in sampler_frame_count_grid. Turning off this option may slow down training because sorting by length can better utilize the computing resources.

visibilityacoustic, variance
scopetraining
customizabilitynot recommended
typebool
defaulttrue

speakers

The names of speakers in a multi-speaker model. Speaker names are mapped to speaker indexes and stored into spk_map.json when preprocessing.

visibilityacoustic, variance
scopepreprocessing
customizabilityrequired
typelist

spk_ids

The IDs of speakers in a multi-speaker model. If an empty list is given, speaker IDs will be automatically generated as $0,1,2,...,N_{spk}-1$. IDs can be duplicate or discontinuous.

visibilityacoustic, variance
scopepreprocessing
customizabilityrequired
typeList[int]
default[]

spec_min

Minimum mel spectrogram value used for normalization to [-1, 1]. Different mel bins can have different minimum values.

visibilityacoustic
scopeinference
customizabilitynot recommended
typeList[float]
default[-5.0]

spec_max

Maximum mel spectrogram value used for normalization to [-1, 1]. Different mel bins can have different maximum values.

visibilityacoustic
scopeinference
customizabilitynot recommended
typeList[float]
default[0.0]

T_start

The starting value of time $t$ in the Rectified Flow ODE which applies on $t \in (T_{start}, 1)$.

visibilityacoustic
scopetraining
customizabilityrecommended
typefloat
default0.4

T_start_infer

The starting value of time $t$ in the ODE during shallow Rectified Flow inference. Normally set as same as T_start.

visibilityacoustic
scopeinference
customizabilityrecommended
typefloat
default0.4
constraintsShould be no less than T_start.

task_cls

Task trainer class name.

visibilityall
scopetraining
customizabilityreserved
typestr

tension_logit_max

Maximum tension logit value used for normalization to [-1, 1]. Logit is the reverse function of Sigmoid:

$$ f(x) = \ln\frac{x}{1-x} $$

visibilityvariance
scopeinference
customizabilityrecommended
typefloat
default10.0

tension_logit_min

Minimum tension logit value used for normalization to [-1, 1]. Logit is the reverse function of Sigmoid:

$$ f(x) = \ln\frac{x}{1-x} $$

visibilityvariance
scopeinference
customizabilityrecommended
typefloat
default-10.0

tension_smooth_width

Length of sinusoidal smoothing convolution kernel (in seconds) on extracted tension curve.

visibilityacoustic, variance
scopepreprocessing
customizabilitynormal
typefloat
default0.12

test_prefixes

List of data item names or name prefixes for the validation set. For each string s in the list:

  • If s equals to an actual item name, add that item to validation set.
  • If s does not equal to any item names, add all items whose names start with s to validation set.

For multi-speaker combined datasets, "ds_id:name_prefix" can be used to apply the rules above within data from a specific sub-dataset, where ds_id represents the dataset index.

visibilityall
scopepreprocessing
customizabilityrequired
typelist

time_scale_factor

The scale factor that will be multiplied on the time $t$ of Rectified Flow before embedding into the model.

visibilityacoustic, variance
scopenn
customizabilitynot recommended
typefloat
default1000

timesteps

Total number of DDPM steps.

visibilityacoustic, variance
scopenn
customizabilitynot recommended
typeint
default1000

use_breathiness_embed

Whether to accept and embed breathiness values into the model.

visibilityacoustic
scopenn, preprocessing, inference
customizabilityrecommended
typeboolean
defaultfalse

use_energy_embed

Whether to accept and embed energy values into the model.

visibilityacoustic
scopenn, preprocessing, inference
customizabilityrecommended
typeboolean
defaultfalse

use_glide_embed

Whether to accept and embed glide types in melody encoder.

visibilityvariance
scopenn, preprocessing, inference
customizabilityrecommended
typeboolean
defaultfalse
constraintsOnly take affects when melody encoder is enabled.

use_key_shift_embed

Whether to embed key shifting values introduced by random pitch shifting augmentation.

visibilityacoustic
scopenn, preprocessing, inference
customizabilityrecommended
typeboolean
defaultfalse
constraintsMust be true if random pitch shifting is enabled.

use_melody_encoder

Whether to enable melody encoder for the pitch predictor.

visibilityvariance
scopenn
customizabilityrecommended
typeboolean
defaultfalse

use_pos_embed

Whether to use SinusoidalPositionalEmbedding in FastSpeech2 encoder.

visibilityacoustic, variance
scopenn
customizabilitynot recommended
typeboolean
defaulttrue

use_shallow_diffusion

Whether to use shallow diffusion.

visibilityacoustic
scopenn, inference
customizabilityrecommended
typeboolean
defaultfalse

use_speed_embed

Whether to embed speed values introduced by random time stretching augmentation.

visibilityacoustic
scopenn, preprocessing, inference
typeboolean
defaultfalse
constraintsMust be true if random time stretching is enabled.

use_spk_id

Whether embed the speaker id from a multi-speaker dataset.

visibilityacoustic, variance
scopenn, preprocessing, inference
customizabilityrecommended
typebool
defaultfalse

use_tension_embed

Whether to accept and embed tension values into the model.

visibilityacoustic
scopenn, preprocessing, inference
customizabilityrecommended
typeboolean
defaultfalse

use_voicing_embed

Whether to accept and embed voicing values into the model.

visibilityacoustic
scopenn, preprocessing, inference
customizabilityrecommended
typeboolean
defaultfalse

val_check_interval

Interval (in number of training steps) between validation checks.

visibilityall
scopetraining
customizabilityrecommended
typeint
default2000

val_with_vocoder

Whether to load and use the vocoder to generate audio during validation. Validation audio will not be available if this option is disabled.

visibilityacoustic
scopetraining
customizabilitynormal
typebool
defaulttrue

variances_prediction_args

Arguments for prediction of variance parameters other than pitch, like energy, breathiness, etc.

typedict

variances_prediction_args.dilation_cycle_length

Equivalent to dilation_cycle_length but only for the multi-variance predictor model.

visibilityvariance
default4

variances_prediction_args.total_repeat_bins

Total number of repeating bins in the multi-variance predictor. Repeating bins are distributed evenly to each variance parameter.

visibilityvariance
scopenn, inference
customizabilityrecommended
typeint
default48

variances_prediction_args.residual_channels

Equivalent to residual_channels but only for the multi-variance predictor.

visibilityvariance
default192

variances_prediction_args.residual_layers

Equivalent to residual_layers but only for the multi-variance predictor.

visibilityvariance
default10

vocoder

The vocoder class name.

visibilityacoustic
scopepreprocessing, training, inference
customizabilitynormal
typestr
defaultNsfHifiGAN

vocoder_ckpt

Path of the vocoder model.

visibilityacoustic
scopepreprocessing, training, inference
customizabilitynormal
typestr
defaultcheckpoints/nsf_hifigan/model

voicing_db_max

Maximum voicing value in dB used for normalization to [-1, 1].

visibilityvariance
scopeinference
customizabilityrecommended
typefloat
default-20.0

voicing_db_min

Minimum voicing value in dB used for normalization to [-1, 1].

visibilityacoustic, variance
scopeinference
customizabilityrecommended
typefloat
default-96.0

voicing_smooth_width

Length of sinusoidal smoothing convolution kernel (in seconds) on extracted voicing curve.

visibilityacoustic, variance
scopepreprocessing
customizabilitynormal
typefloat
default0.12

win_size

Window size for mel or feature extraction.

visibilityacoustic, variance
scopepreprocessing
customizabilityreserved
typeint
default2048