Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some improvements / bug fixes #476

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

mitchelldehaven
Copy link

@mitchelldehaven mitchelldehaven commented Apr 24, 2024

Thanks for the repo, its really awesome to use!

Using it I noticed a few problems / personal preferences that I made to improve it. I thought I would create a PR with these changes:

  • 379bee1: The way the callbacks are being set results in the progress bar not showing for Pytorch Lightning during training (not sure if this is intentional)
  • 555e8aa: This is just personal preference, but having early stopping is nice since I found that starting from a checkpoint, it starts to plateau after about ~100-200 epochs, so stopping early saves compute.
  • d8655a1: This is also just personal preference, but since early stopping isn't used and the model could start to overfit substantially after hundreds of epochs, saving the best result on the validation set is nice.
  • e6af4c6: Having the progress bar clued me into what was causing training to be so slow in my case. I was using 200 test samples and how it is currently setup with each validation batch, the code iterates one by one through the test samples and generates the audio. If you have 10 validation batches per test run, this will run 10 times. This simply moves it to on_validation_end so it only runs once. Likely the test audio process could be batched looking at the code, but this already saves quite a bit of time as is.
  • de98f64: abs needs to be taken before max so that the maximum negative amplitude is taken into consideration. Currently `abs is essentially a no op. This is the cause of the warnings about amplitudes being clipped.
  • 8eda759: This allows 16 to be used as a precision. From what I can tell, its only marginally faster for some reason (usually I see ~ 2x improvement in speed). There may be more going on here. Also, Pytorch 1.13 didn't have the best support for bf16, so it is actually slower than using full precision.

If you want some of these changes pulled in, but not others, I can remove the commits you don't want pulled in, if you want any of these changes at all (the only actual bug commits are 379bee1, e6af4c6, de98f64).

Below is the reported time difference, likely all due to e6af4c6 on a small 10 epoch finetuning run when using 200 test samples and ~ 200 validation samples

# 10 epoch finetune
time python -m piper_train     --dataset-dir path/to/my/data     --accelerator 'gpu'     --devices 1     --batch-size 16     --validation-split 0.1     --num-test-examples 200     --max_epochs 6689     --resume_from_checkpoint "path/to/amy-epoch=6679-step=1554200.ckpt"     --checkpoint-epochs 1     --precision 32 --max-phoneme-ids 400
...
real    23m31.791s
user    21m30.965s
sys     1m35.897s


# Fixed problem finetune 10 epochs
time python -m piper_train     --dataset-dir path/to/my/data     --accelerator 'gpu'     --devices 1     --batch-size 16     --validation-split 0.1     --num-test-examples 200     --max_epochs 6689     --resume_from_checkpoint "path/to/amy-epoch=6679-step=1554200.ckpt"     --checkpoint-epochs 1     --precision 32 --max-phoneme-ids 400
...
real    9m31.955s
user    8m33.749s
sys     1m1.055s

EDIT: I don't have a multiple GPU setup, so I didn't test any of this on DP or DDP, specifically 8eda759 could cause issues to multi-gpu setups. I also haven't tested not using a validation set, but I can test to make sure it still works fine.

@bzp83
Copy link

bzp83 commented Apr 24, 2024

this should be merged asap!
edit: it breaks preprocessing #476 (comment)

I'm already using this PR on a local training.

@mitchelldehaven could you explain more about the usaged of --patience please? I didn't know about EarlyStopping... Still learning :)

@bzp83
Copy link

bzp83 commented Apr 24, 2024

I don't know where it could be fixed, but it would look better to have a space or line break between '2' and 'Metric', it's printing the logs after the progress like:

8/48 [00:23<00:00, 2.06it/s, loss=26.1, v_num=2Metric val_loss improved by 1.927 >= min_delta = 0.0. New best score: 51.278

and I also don't see the matching ']' to close the the opening one after '8/48'. Not sure if this is intended.

@bzp83
Copy link

bzp83 commented Apr 24, 2024

one more update...
this PR breaks preprocessing, it failed with:
RuntimeError: Currently, AutocastCPU only support Bfloat16 as the autocast_cpu_dtype

@mitchelldehaven
Copy link
Author

mitchelldehaven commented Apr 24, 2024

@bzp83

The --patience flag essentially tells you how many epochs can elapse without an improvement in the best validation loss before it terminates training. I was using a patience of 10, so if your best validation loss occurs on epoch 50, training will continue until at least epoch 60. If the validation loss for the last 10 epochs did not achieve a better loss value than epoch 50, training will terminate.

The idea is essentially that if you achieve a low validation loss epoch and cannot beat it for several successive epochs, likely your validation loss improvements have plateaued, and further training will not yield improvements. You could try higher values, like maybe 20 or something, however I only have ~ 2000 training samples, so overfitting happens somewhat quickly when starting from a checkpoint.

I can take a look at the preprocessing issues after my workday today, likely the fix is simply changing enabled=True to enabled=False in the autocast contexts.

@rmcpantoja
Copy link
Contributor

Thanks for the repo, its really awesome to use!

Using it I noticed a few problems / personal preferences that I made to improve it. I thought I would create a PR with these changes:

  • 379bee1: The way the callbacks are being set results in the progress bar not showing for Pytorch Lightning during training (not sure if this is intentional)
  • 555e8aa: This is just personal preference, but having early stopping is nice since I found that starting from a checkpoint, it starts to plateau after about ~100-200 epochs, so stopping early saves compute.
  • d8655a1: This is also just personal preference, but since early stopping isn't used and the model could start to overfit substantially after hundreds of epochs, saving the best result on the validation set is nice.
  • e6af4c6: Having the progress bar clued me into what was causing training to be so slow in my case. I was using 200 test samples and how it is currently setup with each validation batch, the code iterates one by one through the test samples and generates the audio. If you have 10 validation batches per test run, this will run 10 times. This simply moves it to on_validation_end so it only runs once. Likely the test audio process could be batched looking at the code, but this already saves quite a bit of time as is.
  • de98f64: abs needs to be taken before max so that the maximum negative amplitude is taken into consideration. Currently `abs is essentially a no op. This is the cause of the warnings about amplitudes being clipped.
  • 8eda759: This allows 16 to be used as a precision. From what I can tell, its only marginally faster for some reason (usually I see ~ 2x improvement in speed). There may be more going on here. Also, Pytorch 1.13 didn't have the best support for bf16, so it is actually slower than using full precision.

If you want some of these changes pulled in, but not others, I can remove the commits you don't want pulled in, if you want any of these changes at all (the only actual bug commits are 379bee1, e6af4c6, de98f64).

Below is the reported time difference, likely all due to e6af4c6 on a small 10 epoch finetuning run when using 200 test samples and ~ 200 validation samples

# 10 epoch finetune
time python -m piper_train     --dataset-dir path/to/my/data     --accelerator 'gpu'     --devices 1     --batch-size 16     --validation-split 0.1     --num-test-examples 200     --max_epochs 6689     --resume_from_checkpoint "path/to/amy-epoch=6679-step=1554200.ckpt"     --checkpoint-epochs 1     --precision 32 --max-phoneme-ids 400
...
real    23m31.791s
user    21m30.965s
sys     1m35.897s


# Fixed problem finetune 10 epochs
time python -m piper_train     --dataset-dir path/to/my/data     --accelerator 'gpu'     --devices 1     --batch-size 16     --validation-split 0.1     --num-test-examples 200     --max_epochs 6689     --resume_from_checkpoint "path/to/amy-epoch=6679-step=1554200.ckpt"     --checkpoint-epochs 1     --precision 32 --max-phoneme-ids 400
...
real    9m31.955s
user    8m33.749s
sys     1m1.055s

EDIT: I don't have a multiple GPU setup, so I didn't test any of this on DP or DDP, specifically 8eda759 could cause issues to multi-gpu setups. I also haven't tested not using a validation set, but I can test to make sure it still works fine.

Hi,
Good job, this seems like significant improvements!
In my fork, I did also improvements to trainer. See it in audio results and this improving checkpointing. These improvements were updated over time, but what caught my attention was the audio generation in validation.

rmcpantoja added a commit to rmcpantoja/piper that referenced this pull request Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants