How to recover if training is interrupted #538

xyt000-xjj · 2023-08-17T07:23:27Z

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

In '...' directory, run command '...'
See error (copy&paste full log, including exceptions and stacktraces).

Please copy&paste text instead of screenshots for better searchability.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. Linux Ubuntu 20.04, Windows 10]
PyTorch version (e.g., pytorch 1.9.0)
CUDA toolkit version (e.g., CUDA 11.4)
NVIDIA driver version
GPU [e.g., Titan V, RTX 3090]
Docker: did you use Docker? If yes, specify docker image URL (e.g., nvcr.io/nvidia/pytorch:21.08-py3)

Additional context
Add any other context about the problem here.

PDillis · 2023-10-27T14:11:30Z

Just point to the last .pkl that was saved and want to resume from with the --resume argument in train.py. Note that the new training will start from 0, so account for that when setting how many images to train for in --kimg.

therealjr · 2023-12-09T22:41:10Z

@PDillis I understand that training will resume from that point. However, when I save a snap at tick 0 it shows nothing but blurred images. Why is it that the images are resetting entirely? Shouldn't it be generating data like it was trained on from the point it left off at?

dookiethedog · 2024-03-16T11:06:19Z

My Gan crashed and I was extremely annoyed as I was experiencing the exact same issue so I decided to read into the code. You can set the inital augmentation and kimg in the training_loop.py file, this will help but this will not actually continue the training from when it last ran it will only give it an idea where to start off again. The Dev's don't seem to care if it does crash as there is no proper resume code, I was actually able to modify the code and create a perfect resume function, however, I will not be able to resume from my first Gan as I did not have my code added yet so there is no way to pull the settings needed, but at least for future I will be all good and have everything stored in the pickle file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to recover if training is interrupted #538

How to recover if training is interrupted #538

xyt000-xjj commented Aug 17, 2023

PDillis commented Oct 27, 2023

therealjr commented Dec 9, 2023

dookiethedog commented Mar 16, 2024

How to recover if training is interrupted #538

How to recover if training is interrupted #538

Comments

xyt000-xjj commented Aug 17, 2023

PDillis commented Oct 27, 2023

therealjr commented Dec 9, 2023

dookiethedog commented Mar 16, 2024