Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the computation of Advantage and State Value in PPO #3

Open
mjbmjb opened this issue May 24, 2018 · 0 comments
Open

About the computation of Advantage and State Value in PPO #3

mjbmjb opened this issue May 24, 2018 · 0 comments

Comments

@mjbmjb
Copy link

mjbmjb commented May 24, 2018

In your implementation of Critic, you feed the network of the observation and action and output 1-dim value. Can I make the inference that It is Q(s,a) ?
But the advantage you given is
values = self.critic_target(states_var, actions_var).detach() advantages = rewards_var - values
It is the estimation of q_t minus Q(s_t,a)
I think it should be Advantage = q_t - V(s_t)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant