You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
resuming training from pre-tained model (sudden quit)
Log:
Last 10 lines of StdErr:
File "/train/train_mem.py", line 13, in <module>
train()
File "//train/train.py", line 1295, in train
trainer.train(resume_from_checkpoint=True)
File "/transformers/trainer.py", line 1850, in train
state = TrainerState.load_from_json(os.path.join(resume_from_checkpoint, TRAINER_STATE_NAME))
File "/transformers/trainer_callback.py", line 148, in load_from_json
with open(json_path, "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: './work_dirs/llava/checkpoint-1000/trainer_state.json'
It seems that the model does not save the trainer_state.json during pre-training. is there a way to include this so it would be possible to resume training?
The text was updated successfully, but these errors were encountered:
Even if you add trainer_state.json file, it will not resume as it will ask for optimizer files and .pth files which still won't be saved. I think the best way is to comment out their function and simply keep their "super(LlaVaTrainer, self) ... " line and let the code run. I have tested this, it does not save the mm_projector.bin file at each stage but it does save the entire weights at each checkpoint.
You can either manually extract the mm_projector weights later. If you don't want to do this, don't worry, at the end of training it automatically saves the trainer_state.json, mm_projector.bin and config.json file after the completion of last step.
Describe the issue
Issue:
Command:
Log:
It seems that the model does not save the trainer_state.json during pre-training. is there a way to include this so it would be possible to resume training?
The text was updated successfully, but these errors were encountered: