Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume ELMo training after crash #217

Open
pjox opened this issue Aug 8, 2019 · 1 comment
Open

Resume ELMo training after crash #217

pjox opened this issue Aug 8, 2019 · 1 comment

Comments

@pjox
Copy link

pjox commented Aug 8, 2019

Hello,

I'm currently trying to train ELMo with my own data, but sadly the process has crashed (cluster problem, nothing to do with the code). Since I have the checkpoints I don't want to loose days of training. However when I tried the restart.py the perplexity jumped way up and it actually seems to me that it just started reading the data from the beginning once again, after all if I understood correctly the restart.py is intended for fine-tuning, not for resuming a traning after a crash. Then I saw that in bilm/training.py line 675 where the training function is provided, one can pass the checkpoint:

def train(options, data, n_gpus, tf_save_dir, tf_log_dir,
          restart_ckpt_file=None):

and actually in line 770 of the same file, the checkpoint appear to be loaded (provided it is passed to the function):

if restart_ckpt_file is not None:
            loader = tf.train.Saver()
            loader.restore(sess, restart_ckpt_file)

However in the bin/train_elmo.py there where the train function is called on line 63, the checkpoint file is not specified:

train(options, data, n_gpus, tf_save_dir, tf_log_dir)

Can I resume my training just putting the checkpoint there at the end? Do I have to do something else to resume training? Is it even possible to resume training without affecting perplexity?

Thank you in advance.

@acriptis
Copy link

acriptis commented Dec 6, 2019

@pjox Have you found the solution?

It seems we need to fix the code in bin/train_elmo.py with providing explicit restart_ckpt_file argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants