-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The third phase of training #24
Comments
Hi,you can continue the training from e.g. the last checkpoint that was saved.Comment-in lines 565, 567 and 580 (and below) to do so.On 29 Feb 2024, at 08:56, QwYko-AHU ***@***.***> wrote:
Hello, I recently ran into trouble when replicating your paper. Due to my own reasons, during the third stage of training, the program was interrupted in the middle, and the training was stopped. I want to know, in this case, can I only retrain, or can I continue the training at the last terminal? Because I saw that the code "# resume_training = False" in line 568 of train_full_model.py was commented out by you, which seems to mean that training can continue?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
|
Yes exactly.On 1 Mar 2024, at 09:17, QwYko-AHU ***@***.***> wrote:
Hi,you can continue the training from e.g. the last checkpoint that was saved.Comment-in lines 565, 567 and 580 (and below) to do so.On 29 Feb 2024, at 08:56, QwYko-AHU @.> wrote: Hello, I recently ran into trouble when replicating your paper. Due to my own reasons, during the third stage of training, the program was interrupted in the middle, and the training was stopped. I want to know, in this case, can I only retrain, or can I continue the training at the last terminal? Because I saw that the code "# resume_training = False" in line 568 of train_full_model.py was commented out by you, which seems to mean that training can continue? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.>
Hello, thank you very much for your timely and helpful reply, this is really exciting news. I would like to ask you about some details. The specific operation is to remove the comment of line 565 in train_full_model.py and change it to "resume_training = Ture". Then change the checkpoint loading in lines 567, 568, 569 to the checkpoint saved at the end of the last training, and finally remove the comment on lines 580 through 586, is that right?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
Thank you for your prompt reply and your patient guidance. I wish you every success in your work.
|
I don’t know, haven’t seen this before.On 4 Mar 2024, at 03:56, QwYko-AHU ***@***.***> wrote:
Hello, I have one more question about the third stage of training, when I get to "Evaluating at step 112800!" Is it normal for the following information to appear? It's stuck here for a long time, and it doesn't use GPU much.
2024-03-04.105400.png (view on web)
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
Well, thank you very much for your prompt reply. |
Hello. I'm sorry to bother you again. I see in your paper that the training of your entire model only took 45 hours, which is incredible, although your graphics card has 48GB of memory and mine is 24GB, but I have spent 80 hours in the third part of training, and so far it is not finished, I don't know if there is something wrong. |
I notice that 90% of my model training time is spent on "evaluate_model". Is this a problem? |
Hello, I recently ran into trouble when replicating your paper. Due to my own reasons, during the third stage of training, the program was interrupted in the middle, and the training was stopped. I want to know, in this case, can I only retrain, or can I continue the training at the last terminal? Because I saw that the code "# resume_training = False" in line 568 of train_full_model.py was commented out by you, which seems to mean that training can continue?
The text was updated successfully, but these errors were encountered: