Best of N benchmark #11

natolambert · 2024-02-08T02:46:35Z

Take a few chat models as the “base set”, say 1-3, like tulu 2 7b and tulu 2 13b (maybe olmo-instruct)
Generate ~8 completions per prompt in AlpacaEval (this is the heldout set)
Use each RM to choose the best-of-1 from that set, then run alpaca eval on the outputs
Score the delta for each RM in the batch on a set task (alpacaeval) and set base model (tulu)
Could do this with MTBench, but two turn is harder

Obvi flaws, but that seems WAY better than nothing.

natolambert · 2024-02-12T22:20:13Z

@yuchenlin is starting this, woohoo!

natolambert · 2024-02-22T20:12:15Z

Partially closed in #30 , wrapping up soon.

natolambert assigned yuchenlin Feb 12, 2024

natolambert linked a pull request Apr 25, 2024 that will close this issue

bon eval #111

Open

Provide feedback