New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Add Model] Pairwise Preference Model #123
Conversation
I created a PairPMPipeline class to use the pair preference model. I also presented an example to use the preference model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WeiXiongUST how much of this can be merged with the existing code for PairRM or at least put in the same file? https://github.com/allenai/reward-bench/blob/main/rewardbench/models/pairrm.py
Otherwise LGTM (style is pending)
The training and use of the models are similar to that of Slic paper SLiC-HF: Sequence Likelihood Calibration with Human Feedback.
While this preference model is also for pairwise comparison, the training and use are quite different from pairRM. I think we can refer it to as the slicpairpm as it is most similar to that of SLiC-HF: Sequence Likelihood Calibration with Human Feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor changes to make sure the scripts work. Sorry they're not documented better, will come soon!
rewardbench/models/slicpairpm.py
Outdated
|
||
class SlicPairPMPipeline: | ||
|
||
def __init__(self, model_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this needs to be modified to match the loading in the scripts.
See
reward-bench/scripts/run_rm.py
Line 173 in a7cf68b
reward_pipe = pipeline_builder( |
Mostly need to take in the args, and if they are not all used that's also fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified accordingly. But since it needs an additional tokenizer to prepare the pair (x, a1, a2) as the input, I currently load an additional tokenizer by
self.tokenizer_data_format = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", use_fast=True)
Co-authored-by: Nathan Lambert <nathanl@allenai.org>
we now use task, model, and tokenizer to init the pipeline.
@WeiXiongUST just need to run the following (I think)
|
This reverts commit d66b833.
Have tested these two commands locally! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! @WeiXiongUST send me the scores and I'll upload or I'll run it soon.
Could you help to add the new pairwise preference model RLHFlow/pair-preference-model-LLaMA3-8B?
The usage of the model is similar to the pairRM where we input a prompt and two responses, and the model will return the probability of the first response being preferred. I try to implement a pipeline in rewardbench/models/pairpm.py and also attach an example to use the model for your reference. I am wondering how should we merge such a customized model into the reward bench. Many thanks in advance!
The benchmark results are as follows.