Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The different results between eval mode and test mode. #26

Open
eyuansu62 opened this issue May 12, 2022 · 14 comments
Open

The different results between eval mode and test mode. #26

eyuansu62 opened this issue May 12, 2022 · 14 comments
Labels
help wanted Extra attention is needed

Comments

@eyuansu62
Copy link

Why I get the different results between eval mode and test mode?
image

@ChenWu98
Copy link
Contributor

Hi,

Could you share the command you ran for this experiment?

@eyuansu62
Copy link
Author

eyuansu62 commented May 13, 2022

The command is as follows:

python -m torch.distributed.launch --nproc_per_node 4 --master_port 12 train.py --seed 2 --cfg Salesforce/T5_3b_finetune_spider_with_cell_value.cfg --run_name T5_3b_finetune_spider_with_cell_value --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 8 --num_train_epochs 400 --adafactor true --learning_rate 1e-4 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_large_finetune_spider_with_cell_value  --overwrite_output_dir --per_device_train_batch_size 2 --per_device_eval_batch_size 8 --generation_num_beams 1 --generation_max_length 512 --input_max_length 512 --ddp_find_unused_parameters true

@ChenWu98
Copy link
Contributor

Is the highest eval score the same as the test score?

@eyuansu62
Copy link
Author

The ckpt I chosen is the highest eval score during the training steps.
As you can see, it is different from the test score.

@ChenWu98
Copy link
Contributor

Can you run the following command on the same machine (which means that the previous checkpoints are still there) and see if the results are different?

python -m torch.distributed.launch --nproc_per_node 4 --master_port 12 train.py --seed 2 --cfg Salesforce/T5_3b_finetune_spider_with_cell_value.cfg --run_name T5_3b_finetune_spider_with_cell_value --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 8 --num_train_epochs 0 --adafactor true --learning_rate 1e-4 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_large_finetune_spider_with_cell_value --per_device_train_batch_size 2 --per_device_eval_batch_size 8 --generation_num_beams 1 --generation_max_length 512 --input_max_length 512 --ddp_find_unused_parameters true

@Timothyxxx
Copy link
Contributor

Timothyxxx commented May 13, 2022

@eyuansu62 Hi, any new progress over there?
We double-checked our experiments log before and didn't find the case you showed, and we looked through the issues of PICARD and saw that you made similar issue in there too. It is very likely we are facing the same issue and same factor in your machine.

Hope we can figure that out together!

@eyuansu62
Copy link
Author

They are still a little different.
image

@Timothyxxx
Copy link
Contributor

Could you double-check the evaluation and prediction json file?
It could help us with where the problem lies.

@eyuansu62
Copy link
Author

I check the evaluation and prediction json file, and find they are indeed different, no matter when do_train=False or num_train_epoch=0.

The different sqls are like follows, just a few conditions are wrong:
select singer.name from concert join singer_in_concert on concert.concert_id = singer_in_concert.concert_id where concert.year = 2014
select singer.name from concert join singer_in_concert on concert.concert_id = singer_in_concert.singer_id where concert.year = 2014

@Timothyxxx
Copy link
Contributor

Okay, I will keep this issue active and see if anyone find similar problem!

@Timothyxxx Timothyxxx added the help wanted Extra attention is needed label May 18, 2022
@Timothyxxx Timothyxxx pinned this issue May 18, 2022
@ChenWu98
Copy link
Contributor

I just realized that the command you provided is for T5-3b without using deepspeed. I remember that we didn't manage to run without deepspeed even on an A100. What kind of GPU are you using, if you remember?

@eyuansu62
Copy link
Author

Well, it is actually t5-large in this cfg file. I forget to change the file name.

@Timothyxxx
Copy link
Contributor

Timothyxxx commented May 19, 2022

Hey, we asked someone else for help to test it on his side and didn't get different result between eval mode and test mode(which is consistent with ours). Therefore we think it may because the machine in your side. Could you provide more info about hardware and system then?

@eyuansu62
Copy link
Author

image
image

@ChenWu98 ChenWu98 unpinned this issue Sep 1, 2022
@ChenWu98 ChenWu98 pinned this issue Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants