Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPO中critic模型不是应当使用reward model模型吗? #73

Open
zhangjian94cn opened this issue Jul 5, 2023 · 0 comments
Open

Comments

@zhangjian94cn
Copy link

代码中使用Value Head来实现PPO中的critic,所定义的detach_value_head函数并没有被使用,也就是说训练过程中,value head之前的主干网络的部分能力还会被用于估计value,这样合理吗?

def detach_value_head(self):

是否可以直接将此行替换成一个reward model的forward函数?

value = self.v_head(hidden_states).squeeze(-1) # (batch, seq_len)

也就是在GPT2HeadWithValueModel初始化时,同时加入reward model的模型接口,这样更合理?

class GPT2HeadWithValueModel(GPT2PreTrainedModel):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant