Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help extending to MAILabs data - Warbly speech - MoL, 1000k steps #183

Open
adhamel opened this issue Mar 18, 2020 · 6 comments
Open

Help extending to MAILabs data - Warbly speech - MoL, 1000k steps #183

adhamel opened this issue Mar 18, 2020 · 6 comments

Comments

@adhamel
Copy link

adhamel commented Mar 18, 2020

Dear @r9y9,
I've trained a MoL wavenet to 1000k steps on ~30,000 audio samples from M-AI Labs data. I am using a pre-trained transformer from @kan-bayashi.

The resulting audio has rather intelligible speech, but has a bit of a warble to it that I would like to clear up. Happy to share generated samples or configurations to help diagnose. Do you have any experience training on that data set or recommendations on what might move me in the right direction?

Best,
Andy

@adhamel adhamel changed the title Warbly speech - MoL, 1000k steps Help extending to MAILabs data - Warbly speech - MoL, 1000k steps Mar 18, 2020
@r9y9
Copy link
Owner

r9y9 commented Mar 24, 2020

Hi, sorry for the late reply. If I remember correctly, samples in M-AI labs are of low SN ratio, and thus WaveNet might suffer from learning a distribution of clean speech. To diagnose what the reasons would be, could you share some generated audio samples and training configurations?

@adhamel
Copy link
Author

adhamel commented Mar 24, 2020

Hey, no worries. I trained with the mixture-of-logistics configuration, used data from a single male Spanish speaker. I've followed your recommendations elsewhere and decreased the log_min allowed as the training progressed.

Here is an sample after ~1.6M steps: https://github.com/adhamel/samples/blob/master/response.wav

For evaluation, I'm using generated _npy features from this transformer (https://github.com/espnet/espnet/blob/master/egs/m_ailabs/tts1/RESULTS.md):

v.0.5.3 / Transformer
Silence trimming
FTT in points: 1024
Shift in points: 256
Frequency limit: 80-7600
Fast-GL 64 iters
Environments
date: Sun Sep 29 21:20:05 JST 2019
python version: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]
espnet version: espnet 0.5.1
chainer version: chainer 6.0.0
pytorch version: pytorch 1.0.1.post2
Git hash: 6b2ff45d1e2c624691f197014b8fe71a5e70bae9
Commit date: Sat Sep 28 14:33:32 2019 +0900

@r9y9
Copy link
Owner

r9y9 commented Mar 25, 2020

Could you also share the config file(s) for WaveNet?

For the generated sample, it seems that the signal gain is too high. I guess there would be a mismatch between acoustic features at the training time and ones at evaluation. Did you carefully normalize acoustic feauters? Did you make sure that you use same acoustic feature pipeline for both training Transformer and WaveNet?

@adhamel
Copy link
Author

adhamel commented Mar 25, 2020

Absolutely. Here are the overwritten hparams. I also tried an fmin value of 125. I did not take care to normalize acoustic features, however the WaveNet is trained on the same data subset as the Transformer.

{
"name": "wavenet_vocoder",
"input_type": "raw",
"quantize_channels": 65536,
"preprocess": "preemphasis",
"postprocess": "inv_preemphasis",
"global_gain_scale": 0.55,
"sample_rate": 16000,
"silence_threshold": 2,
"num_mels": 80,
"fmin": 80,
"fmax": 7600,
"fft_size": 1024,
"hop_size": 256,
"frame_shift_ms": null,
"win_length": 1024,
"win_length_ms": -1.0,
"window": "hann",
"highpass_cutoff": 70.0,
"output_distribution": "Logistic",
"log_scale_min": -32.23619130191664,
"out_channels": 30,
"layers": 24,
"stacks": 4,
"residual_channels": 128,
"gate_channels": 256,
"skip_out_channels": 128,
"dropout": 0.0,
"kernel_size": 3,
"cin_channels": 80,
"cin_pad": 2,
"upsample_conditional_features": true,
"upsample_net": "ConvInUpsampleNetwork",
"upsample_params": {
"upsample_scales": [
4,
4,
4,
4
]
},
"gin_channels": -1,
"n_speakers": 7,
"pin_memory": true,
"num_workers": 2,
"batch_size": 8,
"optimizer": "Adam",
"optimizer_params": {
"lr": 0.001,
"eps": 1e-08,
"weight_decay": 0.0
},
"lr_schedule": "step_learning_rate_decay",
"lr_schedule_kwargs": {
"anneal_rate": 0.5,
"anneal_interval": 200000
},
"max_train_steps": 1000000,
"nepochs": 2000,
"clip_thresh": -1,
"max_time_sec": null,
"max_time_steps": 10240,
"exponential_moving_average": true,
"ema_decay": 0.9999,
"checkpoint_interval": 100000,
"train_eval_interval": 100000,
"test_eval_epoch_interval": 50,
"save_optimizer_state": true
}

@r9y9
Copy link
Owner

r9y9 commented Mar 30, 2020

The harams looks okay. I'd recommend you to double-check acoustic feature normalization differences (if any), and also please check analysis/synthesis quality (not TTS).

Pre-emphasis at the data preprocessing stage changes the signal gain, so you might want to turn global_gain_scale. 0.55 was chosen for LJSpeech if I remember correctly.

Another suggestion is that using more higher log scale min (e.g., -9 or -11). As suggested in ClariNet paper, smaller variance bound requires more iterations for training and could be unstable.

@adhamel
Copy link
Author

adhamel commented Apr 2, 2020

Thank you, you are correct. I will test reducing log scale min. (As a strange aside, I found significant drops in loss at intervals of ~53 epochs.) I hope y'all are staying safe over there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants