Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resuming training is unintuitive #147

Open
free-mana opened this issue Sep 21, 2022 · 0 comments
Open

Resuming training is unintuitive #147

free-mana opened this issue Sep 21, 2022 · 0 comments
Labels
bug Something isn't working

Comments

@free-mana
Copy link

Describe the bug
The process to resume training a previously trained model is unintuitive.

To Reproduce
Steps to reproduce the behavior:

  1. Package Versions: torchgan 0.1.0, torchvision 0.13.1, pytorch 1.12.1
  2. Logging Configurations:
    print(torchgan.logging.backends.CONSOLE_LOGGING)
    1
    print(torchgan.logging.backends.VISDOM_LOGGING)
    0
    print(torchgan.logging.backends.TENSORBOARD_LOGGING)
    0
  3. Minimal Working Example for the error
    The issue can be easily encountered by slightly modifying Tutorial 1. Follow the tutorial normally, until you reach the "Visualizing the Samples" section. Before that section, add the following code cell:
    trainer.load_model("./model/gan4.model")
    trainer(dataloader)
    Now execute the new cell.

Expected behavior
This should have continued training the model for an additional 10 generations on CUDA, or 5 generations otherwise. However, because the Trainer's epochs parameter represents the total number of epochs to train, the function returns without any further training done. In order to achieve the expected behaviour, the user must instead create a new Trainer object and pass (current_epochs + desired_additional_epochs) as the value of the epochs parameter. This is unintuitive, and requires the user to manually keep track of how many epochs have been completed when they end a training session if they plan on continuing it later.

Desktop (please complete the following information):

  • OS: Windows 10 Pro, version 21H2

Installation

  • Pip

Additional context
Fixing this would involve rewriting significant portions of the BaseTrainer class. I would suggest allowing the user to pass in a number of epochs to train in the __call__() function, rather than having it set at object creation.

@free-mana free-mana added the bug Something isn't working label Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant