multi gpu inference with run_rm.py #95

SeungoneKim · 2024-04-01T01:11:35Z

Hello Nathan,

Thank you for this valuable resource! I strongly think that we needed more standardized benchmarks to evaluate reward/evaluator models.

I think submit_eval_jobs.py (using AI2's beaker) supports multi gpu inference but run_rm.py doesn't at the moment.
I was wondering if this intended (correct me if I'm wrong)!

Best,
Seungone

natolambert · 2024-04-01T17:23:08Z

Hey @SeungoneKim -- we just haven't needed it yet (the biggest classifiers are 34B). Happy to add it.

run_dpo.py works nicely with 2,4,6,8 GPUs. That's what it's included.
Lmk if you want to open a PR :)

SeungoneKim · 2024-04-03T12:46:55Z

Thanks for your response @natolambert!

I was trying to test generative reward modeling (with GPT-4, Prometheus, Auto-J) and it seems like run_dpo.py has a slightly different functionality than what I need.

Considering that generative RMs require generating a CoT-ish feedback before their scoring decision, I think it would be best to integrate vllm and add an additional run_generative_rm.py code. Users could add on additional generative rms by implementing the code for parsing the output(reward).

If this makes sense to you, I'll leave a pull request of this and try to maintain the style of the code as similar to run_rm.py!

natolambert · 2024-04-03T14:46:06Z

@SeungoneKim generative RM's (via API) are being added in #86, but adding the full generation thing is another can of worms. I agree with your path, I just worry a bit about complexity. It's prolly worth having though.

The API implementation should be closer to what you want to build off of.

Here are preliminary results

Claude results:
Haiku {‘Chat’: 0.9273743016759777, ‘Chat Hard’: 0.5197368421052632, ‘Safety’: 0.8210275184275184, ‘Reasoning’: 0.7060194658154636}
Sonnet {‘Chat’: 0.9343575418994413, ‘Chat Hard’: 0.5657894736842105, ‘Safety’: 0.8367826605826606, ‘Reasoning’: 0.6907005374583948}
Opus {‘Chat’: 0.946927374301676, ‘Chat Hard’: 0.6030701754385965, ‘Safety’: 0.8905447525447526, ‘Reasoning’: 0.7868223795492989}
(reminder) OpenAI results:
GPT 3.5 {‘Chat’: 0.9217877094972067, ‘Chat Hard’: 0.4451754385964912, ‘Safety’: 0.6229577395577396, ‘Reasoning’: 0.5912315163420091}
GPT 4.Turbo {‘Chat’: 0.952513966480447, ‘Chat Hard’: 0.743421052631579, ‘Safety’: 0.8719219375219376, ‘Reasoning’: 0.8692366453865881}

natolambert added the enhancement New feature or request label Apr 11, 2024

natolambert linked a pull request May 11, 2024 that will close this issue

Add multi-gpu inference option #125

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi gpu inference with run_rm.py #95

multi gpu inference with run_rm.py #95

SeungoneKim commented Apr 1, 2024 •

edited

natolambert commented Apr 1, 2024

SeungoneKim commented Apr 3, 2024 •

edited

natolambert commented Apr 3, 2024

multi gpu inference with run_rm.py #95

multi gpu inference with run_rm.py #95

Comments

SeungoneKim commented Apr 1, 2024 • edited

natolambert commented Apr 1, 2024

SeungoneKim commented Apr 3, 2024 • edited

natolambert commented Apr 3, 2024

SeungoneKim commented Apr 1, 2024 •

edited

SeungoneKim commented Apr 3, 2024 •

edited