Bump AlpacaEval to 0.6, add AlpacaEval 2 #139

hamishivi · 2024-04-05T18:39:11Z

Recently, Alpaca Eval 0.6 came out with length-controlled AlpacaEval 2. I think this is worth including in our evals, now we are saturating alpacaEval 1. This PR adds alpaca eval 2 as an explicit task in the eval script.

Old tulu 7b dpo score: 85.09, new score: 84.45
Old tulu 13b dpo score: 89.46, new score: 88.29
The scores are a little lower but within noise (~1.5pts). There is some indeterminism with vllm too that might be at play here.

yizhongw · 2024-04-29T08:25:27Z

The PR lgtm. Feel free to merge, but we need to be aware of the performance inconsistency for our new experiments.

For the performance drop, it is a bit concerning if we don't know the exact reason. I think vllm should be quite deterministic though. Is there any discussion about the indeterminism?

yizhongw · 2024-04-29T08:00:17Z

eval/alpaca_farm/run_eval.py

@@ -108,7 +108,6 @@ def main(args):
        df_leaderboard, annotations = alpaca_farm_evaluate(
            model_outputs=model_results,
            reference_outputs=args.reference_path,
-            annotators_config="alpaca_eval_gpt4",


So, I guess we use their default configs all the time?

Yeah, AlpacaEval also swaps the default depending on the IS_ALPACA_EVAL_2 flag. Our config there matches their default: https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/constants.py#L42

hamishivi · 2024-04-29T16:16:25Z

re: indeterminism, there isn't particular discussion, but let me wait on merging this until I do 1 or 2 reruns to make sure. Jacob also mentioned that older vllm versions have some indeterminism issues.

hamishivi · 2024-05-25T17:30:23Z

I came back to this and got the following after running three times with the new code:
Tulu 2 13b: 78.8 +/- 0.3, 9.7 +/- 0.2
Tulu 2 13b dpo: 88.5 +/- 0.4, 13.3 +/- 0.2

And with the old code (again running three times):
Tulu 2 13b: 79.3 +/- 0.4
Tulu 2 13b dpo: 88.3 +/- 0.2

So I think the 'old' results before were maybe just a little lucky, and we see that the old and new code seems to match performance quite closely (within .5 points). It seems that AlpacaEval also has a variance lower than I expected (< 1 pt).

Hopefully this is enough, I'm going to merge.

hamishivi requested review from yizhongw and dwadden April 5, 2024 18:39

hamishivi mentioned this pull request Apr 6, 2024

Get Stuck while using alpaca-farm #141

Closed

yizhongw approved these changes Apr 29, 2024

View reviewed changes

bump alpaca eval

6f43970

hamishivi force-pushed the hamishivi-aeval-06 branch from bbd0dd2 to 6f43970 Compare May 25, 2024 15:50

hamishivi merged commit c857718 into main May 25, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump AlpacaEval to 0.6, add AlpacaEval 2 #139

Bump AlpacaEval to 0.6, add AlpacaEval 2 #139

hamishivi commented Apr 5, 2024 •

edited

yizhongw commented Apr 29, 2024

yizhongw Apr 29, 2024

hamishivi Apr 29, 2024

hamishivi commented Apr 29, 2024

hamishivi commented May 25, 2024

Bump AlpacaEval to 0.6, add AlpacaEval 2 #139

Bump AlpacaEval to 0.6, add AlpacaEval 2 #139

Conversation

hamishivi commented Apr 5, 2024 • edited

yizhongw commented Apr 29, 2024

yizhongw Apr 29, 2024

Choose a reason for hiding this comment

hamishivi Apr 29, 2024

Choose a reason for hiding this comment

hamishivi commented Apr 29, 2024

hamishivi commented May 25, 2024

hamishivi commented Apr 5, 2024 •

edited