The different results between eval mode and test mode. #26

eyuansu62 · 2022-05-12T14:03:37Z

Why I get the different results between eval mode and test mode?

ChenWu98 · 2022-05-12T14:31:13Z

Hi,

Could you share the command you ran for this experiment?

eyuansu62 · 2022-05-13T02:22:53Z

The command is as follows:

python -m torch.distributed.launch --nproc_per_node 4 --master_port 12 train.py --seed 2 --cfg Salesforce/T5_3b_finetune_spider_with_cell_value.cfg --run_name T5_3b_finetune_spider_with_cell_value --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 8 --num_train_epochs 400 --adafactor true --learning_rate 1e-4 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_large_finetune_spider_with_cell_value  --overwrite_output_dir --per_device_train_batch_size 2 --per_device_eval_batch_size 8 --generation_num_beams 1 --generation_max_length 512 --input_max_length 512 --ddp_find_unused_parameters true

ChenWu98 · 2022-05-13T02:51:44Z

Is the highest eval score the same as the test score?

eyuansu62 · 2022-05-13T02:58:49Z

The ckpt I chosen is the highest eval score during the training steps.
As you can see, it is different from the test score.

ChenWu98 · 2022-05-13T03:06:39Z

Can you run the following command on the same machine (which means that the previous checkpoints are still there) and see if the results are different?

python -m torch.distributed.launch --nproc_per_node 4 --master_port 12 train.py --seed 2 --cfg Salesforce/T5_3b_finetune_spider_with_cell_value.cfg --run_name T5_3b_finetune_spider_with_cell_value --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 8 --num_train_epochs 0 --adafactor true --learning_rate 1e-4 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_large_finetune_spider_with_cell_value --per_device_train_batch_size 2 --per_device_eval_batch_size 8 --generation_num_beams 1 --generation_max_length 512 --input_max_length 512 --ddp_find_unused_parameters true

Timothyxxx · 2022-05-13T14:13:12Z

@eyuansu62 Hi, any new progress over there?
We double-checked our experiments log before and didn't find the case you showed, and we looked through the issues of PICARD and saw that you made similar issue in there too. It is very likely we are facing the same issue and same factor in your machine.

Hope we can figure that out together!

eyuansu62 · 2022-05-15T15:02:25Z

They are still a little different.

Timothyxxx · 2022-05-16T07:16:44Z

Could you double-check the evaluation and prediction json file?
It could help us with where the problem lies.

eyuansu62 · 2022-05-18T14:20:55Z

I check the evaluation and prediction json file, and find they are indeed different, no matter when do_train=False or num_train_epoch=0.

The different sqls are like follows, just a few conditions are wrong:
select singer.name from concert join singer_in_concert on concert.concert_id = singer_in_concert.concert_id where concert.year = 2014
select singer.name from concert join singer_in_concert on concert.concert_id = singer_in_concert.singer_id where concert.year = 2014

Timothyxxx · 2022-05-18T15:11:37Z

Okay, I will keep this issue active and see if anyone find similar problem!

ChenWu98 · 2022-05-18T16:07:19Z

I just realized that the command you provided is for T5-3b without using deepspeed. I remember that we didn't manage to run without deepspeed even on an A100. What kind of GPU are you using, if you remember?

eyuansu62 · 2022-05-19T03:12:34Z

Well, it is actually t5-large in this cfg file. I forget to change the file name.

Timothyxxx · 2022-05-19T03:18:41Z

Hey, we asked someone else for help to test it on his side and didn't get different result between eval mode and test mode(which is consistent with ours). Therefore we think it may because the machine in your side. Could you provide more info about hardware and system then?

eyuansu62 · 2022-05-19T09:17:01Z

Timothyxxx added the help wanted Extra attention is needed label May 18, 2022

Timothyxxx pinned this issue May 18, 2022

ChenWu98 unpinned this issue Sep 1, 2022

ChenWu98 pinned this issue Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The different results between eval mode and test mode. #26

The different results between eval mode and test mode. #26

eyuansu62 commented May 12, 2022

ChenWu98 commented May 12, 2022

eyuansu62 commented May 13, 2022 •

edited

ChenWu98 commented May 13, 2022

eyuansu62 commented May 13, 2022

ChenWu98 commented May 13, 2022

Timothyxxx commented May 13, 2022 •

edited

eyuansu62 commented May 15, 2022

Timothyxxx commented May 16, 2022

eyuansu62 commented May 18, 2022

Timothyxxx commented May 18, 2022

ChenWu98 commented May 18, 2022

eyuansu62 commented May 19, 2022

Timothyxxx commented May 19, 2022 •

edited

eyuansu62 commented May 19, 2022

The different results between eval mode and test mode. #26

The different results between eval mode and test mode. #26

Comments

eyuansu62 commented May 12, 2022

ChenWu98 commented May 12, 2022

eyuansu62 commented May 13, 2022 • edited

ChenWu98 commented May 13, 2022

eyuansu62 commented May 13, 2022

ChenWu98 commented May 13, 2022

Timothyxxx commented May 13, 2022 • edited

eyuansu62 commented May 15, 2022

Timothyxxx commented May 16, 2022

eyuansu62 commented May 18, 2022

Timothyxxx commented May 18, 2022

ChenWu98 commented May 18, 2022

eyuansu62 commented May 19, 2022

Timothyxxx commented May 19, 2022 • edited

eyuansu62 commented May 19, 2022

eyuansu62 commented May 13, 2022 •

edited

Timothyxxx commented May 13, 2022 •

edited

Timothyxxx commented May 19, 2022 •

edited