Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The third phase of training #24

Open
QwYko-AHU opened this issue Feb 29, 2024 · 9 comments
Open

The third phase of training #24

QwYko-AHU opened this issue Feb 29, 2024 · 9 comments

Comments

@QwYko-AHU
Copy link

QwYko-AHU commented Feb 29, 2024

Hello, I recently ran into trouble when replicating your paper. Due to my own reasons, during the third stage of training, the program was interrupted in the middle, and the training was stopped. I want to know, in this case, can I only retrain, or can I continue the training at the last terminal? Because I saw that the code "# resume_training = False" in line 568 of train_full_model.py was commented out by you, which seems to mean that training can continue?

@ttanida
Copy link
Owner

ttanida commented Feb 29, 2024 via email

@QwYko-AHU
Copy link
Author

Hi,you can continue the training from e.g. the last checkpoint that was saved.Comment-in lines 565, 567 and 580 (and below) to do so.On 29 Feb 2024, at 08:56, QwYko-AHU @.> wrote: Hello, I recently ran into trouble when replicating your paper. Due to my own reasons, during the third stage of training, the program was interrupted in the middle, and the training was stopped. I want to know, in this case, can I only retrain, or can I continue the training at the last terminal? Because I saw that the code "# resume_training = False" in line 568 of train_full_model.py was commented out by you, which seems to mean that training can continue? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.>
Hello, thank you very much for your timely and helpful reply, this is really exciting news. I would like to ask you about some details. The specific operation is to remove the comment of line 565 in train_full_model.py and change it to "resume_training = Ture". Then change the checkpoint loading in lines 567, 568, 569 to the checkpoint saved at the end of the last training, and finally remove the comment on lines 580 through 586, is that right?

@ttanida
Copy link
Owner

ttanida commented Mar 1, 2024 via email

@QwYko-AHU
Copy link
Author

Thank you for your prompt reply and your patient guidance. I wish you every success in your work.

Yes exactly.On 1 Mar 2024, at 09:17, QwYko-AHU @.> wrote: Hi,you can continue the training from e.g. the last checkpoint that was saved.Comment-in lines 565, 567 and 580 (and below) to do so.On 29 Feb 2024, at 08:56, QwYko-AHU @.> wrote: Hello, I recently ran into trouble when replicating your paper. Due to my own reasons, during the third stage of training, the program was interrupted in the middle, and the training was stopped. I want to know, in this case, can I only retrain, or can I continue the training at the last terminal? Because I saw that the code "# resume_training = False" in line 568 of train_full_model.py was commented out by you, which seems to mean that training can continue? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.> Hello, thank you very much for your timely and helpful reply, this is really exciting news. I would like to ask you about some details. The specific operation is to remove the comment of line 565 in train_full_model.py and change it to "resume_training = Ture". Then change the checkpoint loading in lines 567, 568, 569 to the checkpoint saved at the end of the last training, and finally remove the comment on lines 580 through 586, is that right? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>

@QwYko-AHU
Copy link
Author

Hello, I have one more question about the third stage of training, when I get to "Evaluating at step 112800!" Is it normal for the following information to appear? It's stuck here for a long time, and it doesn't use GPU much.
屏幕截图 2024-03-04 105400

@ttanida
Copy link
Owner

ttanida commented Mar 4, 2024 via email

@QwYko-AHU
Copy link
Author

I don’t know, haven’t seen this before.On 4 Mar 2024, at 03:56, QwYko-AHU @.> wrote: Hello, I have one more question about the third stage of training, when I get to "Evaluating at step 112800!" Is it normal for the following information to appear? It's stuck here for a long time, and it doesn't use GPU much. 2024-03-04.105400.png (view on web) —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>

Well, thank you very much for your prompt reply.

@QwYko-AHU
Copy link
Author

Hello. I'm sorry to bother you again. I see in your paper that the training of your entire model only took 45 hours, which is incredible, although your graphics card has 48GB of memory and mine is 24GB, but I have spent 80 hours in the third part of training, and so far it is not finished, I don't know if there is something wrong.

@QwYko-AHU
Copy link
Author

I notice that 90% of my model training time is spent on "evaluate_model". Is this a problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants