About the computation of Advantage and State Value in PPO #3

mjbmjb · 2018-05-24T06:43:54Z

In your implementation of Critic, you feed the network of the observation and action and output 1-dim value. Can I make the inference that It is Q(s,a) ?
But the advantage you given is
values = self.critic_target(states_var, actions_var).detach() advantages = rewards_var - values
It is the estimation of q_t minus Q(s_t,a)
I think it should be Advantage = q_t - V(s_t)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the computation of Advantage and State Value in PPO #3

About the computation of Advantage and State Value in PPO #3

mjbmjb commented May 24, 2018

About the computation of Advantage and State Value in PPO #3

About the computation of Advantage and State Value in PPO #3

Comments

mjbmjb commented May 24, 2018