Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump AlpacaEval to 0.6, add AlpacaEval 2 #139

Merged
merged 1 commit into from May 25, 2024
Merged

Conversation

hamishivi
Copy link
Collaborator

@hamishivi hamishivi commented Apr 5, 2024

Recently, Alpaca Eval 0.6 came out with length-controlled AlpacaEval 2. I think this is worth including in our evals, now we are saturating alpacaEval 1. This PR adds alpaca eval 2 as an explicit task in the eval script.

Old tulu 7b dpo score: 85.09, new score: 84.45
Old tulu 13b dpo score: 89.46, new score: 88.29
The scores are a little lower but within noise (~1.5pts). There is some indeterminism with vllm too that might be at play here.

@yizhongw
Copy link
Contributor

The PR lgtm. Feel free to merge, but we need to be aware of the performance inconsistency for our new experiments.

For the performance drop, it is a bit concerning if we don't know the exact reason. I think vllm should be quite deterministic though. Is there any discussion about the indeterminism?

@@ -108,7 +108,6 @@ def main(args):
df_leaderboard, annotations = alpaca_farm_evaluate(
model_outputs=model_results,
reference_outputs=args.reference_path,
annotators_config="alpaca_eval_gpt4",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I guess we use their default configs all the time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, AlpacaEval also swaps the default depending on the IS_ALPACA_EVAL_2 flag. Our config there matches their default: https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/constants.py#L42

@hamishivi
Copy link
Collaborator Author

re: indeterminism, there isn't particular discussion, but let me wait on merging this until I do 1 or 2 reruns to make sure. Jacob also mentioned that older vllm versions have some indeterminism issues.

@hamishivi
Copy link
Collaborator Author

I came back to this and got the following after running three times with the new code:
Tulu 2 13b: 78.8 +/- 0.3, 9.7 +/- 0.2
Tulu 2 13b dpo: 88.5 +/- 0.4, 13.3 +/- 0.2

And with the old code (again running three times):
Tulu 2 13b: 79.3 +/- 0.4
Tulu 2 13b dpo: 88.3 +/- 0.2

So I think the 'old' results before were maybe just a little lucky, and we see that the old and new code seems to match performance quite closely (within .5 points). It seems that AlpacaEval also has a variance lower than I expected (< 1 pt).

Hopefully this is enough, I'm going to merge.

@hamishivi hamishivi merged commit c857718 into main May 25, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants