Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smaller test/val loss but lower evaluation accuracy #750

Open
shuangyichen opened this issue Jan 17, 2024 · 3 comments
Open

Smaller test/val loss but lower evaluation accuracy #750

shuangyichen opened this issue Jan 17, 2024 · 3 comments

Comments

@shuangyichen
Copy link

When I finetune llama-7b on gsm-8k with different finetuning methods. I compared the test loss and evaluation accuracy of different methods and found that one of the method has smaller test/val loss but lower evaluation accuracy. Is it reasonable?

@qbc2016
Copy link
Collaborator

qbc2016 commented Jan 18, 2024

Hello! It may be related to the scale of your dataset partition. If the test/val dataset is too small, then the loss will be unstable.
On the other hand, the evaluation accuracy only depends on one exact value, which is parsed from the generated text, but the val/test loss is calculated among all the tokens the model generates.
We also find that the validation loss may not be a reliable indicator of the generalization performance. For more details, please refer to our paper.
Best regards,

@shuangyichen
Copy link
Author

Hello! It may be related to the scale of your dataset partition. If the test/val dataset is too small, then the loss will be unstable. On the other hand, the evaluation accuracy only depends on one exact value, which is parsed from the generated text, but the val/test loss is calculated among all the tokens the model generates. We also find that the validation loss may not be a reliable indicator of the generalization performance. For more details, please refer to our paper. Best regards,

I wonder the phenomenon discussed in your paper is just in low-fidelity scenario or in general FL?

@qbc2016
Copy link
Collaborator

qbc2016 commented Jan 22, 2024

Hello! It may be related to the scale of your dataset partition. If the test/val dataset is too small, then the loss will be unstable. On the other hand, the evaluation accuracy only depends on one exact value, which is parsed from the generated text, but the val/test loss is calculated among all the tokens the model generates. We also find that the validation loss may not be a reliable indicator of the generalization performance. For more details, please refer to our paper. Best regards,

I wonder the phenomenon discussed in your paper is just in low-fidelity scenario or in general FL?

In the paper, what we observe is in a low-fidelity scenario, but finetuning LLM in general FL, it may be interesting to investigate the relationship between val/test loss and the final evaluation accuracy. I'm not sure there's been a study on this。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants