Some improvements / bug fixes #476

mitchelldehaven · 2024-04-24T02:06:03Z

Thanks for the repo, its really awesome to use!

Using it I noticed a few problems / personal preferences that I made to improve it. I thought I would create a PR with these changes:

379bee1: The way the callbacks are being set results in the progress bar not showing for Pytorch Lightning during training (not sure if this is intentional)
555e8aa: This is just personal preference, but having early stopping is nice since I found that starting from a checkpoint, it starts to plateau after about ~100-200 epochs, so stopping early saves compute.
d8655a1: This is also just personal preference, but since early stopping isn't used and the model could start to overfit substantially after hundreds of epochs, saving the best result on the validation set is nice.
e6af4c6: Having the progress bar clued me into what was causing training to be so slow in my case. I was using 200 test samples and how it is currently setup with each validation batch, the code iterates one by one through the test samples and generates the audio. If you have 10 validation batches per test run, this will run 10 times. This simply moves it to on_validation_end so it only runs once. Likely the test audio process could be batched looking at the code, but this already saves quite a bit of time as is.
de98f64: abs needs to be taken before max so that the maximum negative amplitude is taken into consideration. Currently `abs is essentially a no op. This is the cause of the warnings about amplitudes being clipped.
8eda759: This allows 16 to be used as a precision. From what I can tell, its only marginally faster for some reason (usually I see ~ 2x improvement in speed). There may be more going on here. Also, Pytorch 1.13 didn't have the best support for bf16, so it is actually slower than using full precision.

If you want some of these changes pulled in, but not others, I can remove the commits you don't want pulled in, if you want any of these changes at all (the only actual bug commits are 379bee1, e6af4c6, de98f64).

Below is the reported time difference, likely all due to e6af4c6 on a small 10 epoch finetuning run when using 200 test samples and ~ 200 validation samples

# 10 epoch finetune
time python -m piper_train     --dataset-dir path/to/my/data     --accelerator 'gpu'     --devices 1     --batch-size 16     --validation-split 0.1     --num-test-examples 200     --max_epochs 6689     --resume_from_checkpoint "path/to/amy-epoch=6679-step=1554200.ckpt"     --checkpoint-epochs 1     --precision 32 --max-phoneme-ids 400
...
real    23m31.791s
user    21m30.965s
sys     1m35.897s


# Fixed problem finetune 10 epochs
time python -m piper_train     --dataset-dir path/to/my/data     --accelerator 'gpu'     --devices 1     --batch-size 16     --validation-split 0.1     --num-test-examples 200     --max_epochs 6689     --resume_from_checkpoint "path/to/amy-epoch=6679-step=1554200.ckpt"     --checkpoint-epochs 1     --precision 32 --max-phoneme-ids 400
...
real    9m31.955s
user    8m33.749s
sys     1m1.055s

EDIT: I don't have a multiple GPU setup, so I didn't test any of this on DP or DDP, specifically 8eda759 could cause issues to multi-gpu setups. I also haven't tested not using a validation set, but I can test to make sure it still works fine.

… for each validation batch

…messages due to amplitude issues

… fp16 training

bzp83 · 2024-04-24T04:01:38Z

~~this should be merged asap!~~
edit: it breaks preprocessing #476 (comment)

I'm already using this PR on a local training.

@mitchelldehaven could you explain more about the usaged of --patience please? I didn't know about EarlyStopping... Still learning :)

bzp83 · 2024-04-24T04:33:54Z

I don't know where it could be fixed, but it would look better to have a space or line break between '2' and 'Metric', it's printing the logs after the progress like:

8/48 [00:23<00:00, 2.06it/s, loss=26.1, v_num=2Metric val_loss improved by 1.927 >= min_delta = 0.0. New best score: 51.278

and I also don't see the matching ']' to close the the opening one after '8/48'. Not sure if this is intended.

bzp83 · 2024-04-24T14:04:39Z

one more update...
this PR breaks preprocessing, it failed with:
RuntimeError: Currently, AutocastCPU only support Bfloat16 as the autocast_cpu_dtype

mitchelldehaven · 2024-04-24T15:20:07Z

@bzp83

The --patience flag essentially tells you how many epochs can elapse without an improvement in the best validation loss before it terminates training. I was using a patience of 10, so if your best validation loss occurs on epoch 50, training will continue until at least epoch 60. If the validation loss for the last 10 epochs did not achieve a better loss value than epoch 50, training will terminate.

The idea is essentially that if you achieve a low validation loss epoch and cannot beat it for several successive epochs, likely your validation loss improvements have plateaued, and further training will not yield improvements. You could try higher values, like maybe 20 or something, however I only have ~ 2000 training samples, so overfitting happens somewhat quickly when starting from a checkpoint.

I can take a look at the preprocessing issues after my workday today, likely the fix is simply changing enabled=True to enabled=False in the autocast contexts.

rmcpantoja · 2024-04-24T16:06:04Z

Thanks for the repo, its really awesome to use!

Using it I noticed a few problems / personal preferences that I made to improve it. I thought I would create a PR with these changes:

379bee1: The way the callbacks are being set results in the progress bar not showing for Pytorch Lightning during training (not sure if this is intentional)

555e8aa: This is just personal preference, but having early stopping is nice since I found that starting from a checkpoint, it starts to plateau after about ~100-200 epochs, so stopping early saves compute.

d8655a1: This is also just personal preference, but since early stopping isn't used and the model could start to overfit substantially after hundreds of epochs, saving the best result on the validation set is nice.

e6af4c6: Having the progress bar clued me into what was causing training to be so slow in my case. I was using 200 test samples and how it is currently setup with each validation batch, the code iterates one by one through the test samples and generates the audio. If you have 10 validation batches per test run, this will run 10 times. This simply moves it to on_validation_end so it only runs once. Likely the test audio process could be batched looking at the code, but this already saves quite a bit of time as is.

de98f64: abs needs to be taken before max so that the maximum negative amplitude is taken into consideration. Currently `abs is essentially a no op. This is the cause of the warnings about amplitudes being clipped.

8eda759: This allows 16 to be used as a precision. From what I can tell, its only marginally faster for some reason (usually I see ~ 2x improvement in speed). There may be more going on here. Also, Pytorch 1.13 didn't have the best support for bf16, so it is actually slower than using full precision.

If you want some of these changes pulled in, but not others, I can remove the commits you don't want pulled in, if you want any of these changes at all (the only actual bug commits are 379bee1, e6af4c6, de98f64).

Below is the reported time difference, likely all due to e6af4c6 on a small 10 epoch finetuning run when using 200 test samples and ~ 200 validation samples
# 10 epoch finetune
time python -m piper_train     --dataset-dir path/to/my/data     --accelerator 'gpu'     --devices 1     --batch-size 16     --validation-split 0.1     --num-test-examples 200     --max_epochs 6689     --resume_from_checkpoint "path/to/amy-epoch=6679-step=1554200.ckpt"     --checkpoint-epochs 1     --precision 32 --max-phoneme-ids 400
...
real    23m31.791s
user    21m30.965s
sys     1m35.897s


# Fixed problem finetune 10 epochs
time python -m piper_train     --dataset-dir path/to/my/data     --accelerator 'gpu'     --devices 1     --batch-size 16     --validation-split 0.1     --num-test-examples 200     --max_epochs 6689     --resume_from_checkpoint "path/to/amy-epoch=6679-step=1554200.ckpt"     --checkpoint-epochs 1     --precision 32 --max-phoneme-ids 400
...
real    9m31.955s
user    8m33.749s
sys     1m1.055s
EDIT: I don't have a multiple GPU setup, so I didn't test any of this on DP or DDP, specifically 8eda759 could cause issues to multi-gpu setups. I also haven't tested not using a validation set, but I can test to make sure it still works fine.

Hi,
Good job, this seems like significant improvements!
In my fork, I did also improvements to trainer. See it in audio results and this improving checkpointing. These improvements were updated over time, but what caught my attention was the audio generation in validation.

mitchelldehaven added 6 commits April 23, 2024 13:57

Change callback population to preserve progress bar

379bee1

Add early stopping to prevent training for too long and overfitting

555e8aa

Only save best model based on validation loss

d8655a1

Put test set generation in 'on_validation_end', otherwise it gets ran…

e6af4c6

… for each validation batch

Absolute value needs to be taken before max, this results in warning …

de98f64

…messages due to amplitude issues

Force ops without half precision support in float32 context, allowing…

8eda759

… fp16 training

rmcpantoja added a commit to rmcpantoja/piper that referenced this pull request Apr 25, 2024

Include improvements of rhasspy/piper/rhasspy#476.

445af0a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some improvements / bug fixes #476

Some improvements / bug fixes #476

mitchelldehaven commented Apr 24, 2024 •

edited

bzp83 commented Apr 24, 2024 •

edited

bzp83 commented Apr 24, 2024

bzp83 commented Apr 24, 2024

mitchelldehaven commented Apr 24, 2024 •

edited

rmcpantoja commented Apr 24, 2024

Some improvements / bug fixes #476

Are you sure you want to change the base?

Some improvements / bug fixes #476

Conversation

mitchelldehaven commented Apr 24, 2024 • edited

bzp83 commented Apr 24, 2024 • edited

bzp83 commented Apr 24, 2024

bzp83 commented Apr 24, 2024

mitchelldehaven commented Apr 24, 2024 • edited

rmcpantoja commented Apr 24, 2024

mitchelldehaven commented Apr 24, 2024 •

edited

bzp83 commented Apr 24, 2024 •

edited

mitchelldehaven commented Apr 24, 2024 •

edited