The third phase of training #24

QwYko-AHU · 2024-02-29T07:56:34Z

Hello, I recently ran into trouble when replicating your paper. Due to my own reasons, during the third stage of training, the program was interrupted in the middle, and the training was stopped. I want to know, in this case, can I only retrain, or can I continue the training at the last terminal? Because I saw that the code "# resume_training = False" in line 568 of train_full_model.py was commented out by you, which seems to mean that training can continue?

ttanida · 2024-02-29T10:23:06Z

Hi,you can continue the training from e.g. the last checkpoint that was saved.Comment-in lines 565, 567 and 580 (and below) to do so.On 29 Feb 2024, at 08:56, QwYko-AHU ***@***.***> wrote: Hello, I recently ran into trouble when replicating your paper. Due to my own reasons, during the third stage of training, the program was interrupted in the middle, and the training was stopped. I want to know, in this case, can I only retrain, or can I continue the training at the last terminal? Because I saw that the code "# resume_training = False" in line 568 of train_full_model.py was commented out by you, which seems to mean that training can continue? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

QwYko-AHU · 2024-03-01T08:17:13Z

Hi,you can continue the training from e.g. the last checkpoint that was saved.Comment-in lines 565, 567 and 580 (and below) to do so.On 29 Feb 2024, at 08:56, QwYko-AHU @.> wrote: Hello, I recently ran into trouble when replicating your paper. Due to my own reasons, during the third stage of training, the program was interrupted in the middle, and the training was stopped. I want to know, in this case, can I only retrain, or can I continue the training at the last terminal? Because I saw that the code "# resume_training = False" in line 568 of train_full_model.py was commented out by you, which seems to mean that training can continue? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.>
Hello, thank you very much for your timely and helpful reply, this is really exciting news. I would like to ask you about some details. The specific operation is to remove the comment of line 565 in train_full_model.py and change it to "resume_training = Ture". Then change the checkpoint loading in lines 567, 568, 569 to the checkpoint saved at the end of the last training, and finally remove the comment on lines 580 through 586, is that right?

ttanida · 2024-03-01T10:32:46Z

Yes exactly.On 1 Mar 2024, at 09:17, QwYko-AHU ***@***.***> wrote: Hi,you can continue the training from e.g. the last checkpoint that was saved.Comment-in lines 565, 567 and 580 (and below) to do so.On 29 Feb 2024, at 08:56, QwYko-AHU @.> wrote: Hello, I recently ran into trouble when replicating your paper. Due to my own reasons, during the third stage of training, the program was interrupted in the middle, and the training was stopped. I want to know, in this case, can I only retrain, or can I continue the training at the last terminal? Because I saw that the code "# resume_training = False" in line 568 of train_full_model.py was commented out by you, which seems to mean that training can continue? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.> Hello, thank you very much for your timely and helpful reply, this is really exciting news. I would like to ask you about some details. The specific operation is to remove the comment of line 565 in train_full_model.py and change it to "resume_training = Ture". Then change the checkpoint loading in lines 567, 568, 569 to the checkpoint saved at the end of the last training, and finally remove the comment on lines 580 through 586, is that right? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>

QwYko-AHU · 2024-03-01T11:53:26Z

Thank you for your prompt reply and your patient guidance. I wish you every success in your work.

Yes exactly.On 1 Mar 2024, at 09:17, QwYko-AHU @.> wrote: Hi,you can continue the training from e.g. the last checkpoint that was saved.Comment-in lines 565, 567 and 580 (and below) to do so.On 29 Feb 2024, at 08:56, QwYko-AHU @.> wrote: Hello, I recently ran into trouble when replicating your paper. Due to my own reasons, during the third stage of training, the program was interrupted in the middle, and the training was stopped. I want to know, in this case, can I only retrain, or can I continue the training at the last terminal? Because I saw that the code "# resume_training = False" in line 568 of train_full_model.py was commented out by you, which seems to mean that training can continue? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.> Hello, thank you very much for your timely and helpful reply, this is really exciting news. I would like to ask you about some details. The specific operation is to remove the comment of line 565 in train_full_model.py and change it to "resume_training = Ture". Then change the checkpoint loading in lines 567, 568, 569 to the checkpoint saved at the end of the last training, and finally remove the comment on lines 580 through 586, is that right? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>

QwYko-AHU · 2024-03-04T02:56:05Z

Hello, I have one more question about the third stage of training, when I get to "Evaluating at step 112800!" Is it normal for the following information to appear? It's stuck here for a long time, and it doesn't use GPU much.

ttanida · 2024-03-04T07:50:02Z

I don’t know, haven’t seen this before.On 4 Mar 2024, at 03:56, QwYko-AHU ***@***.***> wrote: Hello, I have one more question about the third stage of training, when I get to "Evaluating at step 112800!" Is it normal for the following information to appear? It's stuck here for a long time, and it doesn't use GPU much. 2024-03-04.105400.png (view on web) —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>

QwYko-AHU · 2024-03-04T08:19:17Z

I don’t know, haven’t seen this before.On 4 Mar 2024, at 03:56, QwYko-AHU @.> wrote: Hello, I have one more question about the third stage of training, when I get to "Evaluating at step 112800!" Is it normal for the following information to appear? It's stuck here for a long time, and it doesn't use GPU much. 2024-03-04.105400.png (view on web) —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>

Well, thank you very much for your prompt reply.

QwYko-AHU · 2024-03-13T03:19:17Z

Hello. I'm sorry to bother you again. I see in your paper that the training of your entire model only took 45 hours, which is incredible, although your graphics card has 48GB of memory and mine is 24GB, but I have spent 80 hours in the third part of training, and so far it is not finished, I don't know if there is something wrong.

QwYko-AHU · 2024-03-13T03:21:28Z

I notice that 90% of my model training time is spent on "evaluate_model". Is this a problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The third phase of training #24

The third phase of training #24

QwYko-AHU commented Feb 29, 2024 •

edited

ttanida commented Feb 29, 2024 via email

QwYko-AHU commented Mar 1, 2024

ttanida commented Mar 1, 2024 via email

QwYko-AHU commented Mar 1, 2024

QwYko-AHU commented Mar 4, 2024

ttanida commented Mar 4, 2024 via email

QwYko-AHU commented Mar 4, 2024

QwYko-AHU commented Mar 13, 2024

QwYko-AHU commented Mar 13, 2024

The third phase of training #24

The third phase of training #24

Comments

QwYko-AHU commented Feb 29, 2024 • edited

ttanida commented Feb 29, 2024 via email

QwYko-AHU commented Mar 1, 2024

ttanida commented Mar 1, 2024 via email

QwYko-AHU commented Mar 1, 2024

QwYko-AHU commented Mar 4, 2024

ttanida commented Mar 4, 2024 via email

QwYko-AHU commented Mar 4, 2024

QwYko-AHU commented Mar 13, 2024

QwYko-AHU commented Mar 13, 2024

QwYko-AHU commented Feb 29, 2024 •

edited