You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(Updated for clarity)
Apologies if I'm wrong, but it seems to me that there are some mathematical issues in unit 4 "diving deeper..." as well as in the optional section on the proof of the policy gradient theorem.
More major error
The main background issue seems to be an instance of the problems discussed in:
There are two types of objective functions $J(\theta)$: you could take the sum of all rewards $r_0+r_1+r_2\cdots$, or you could take a discounted sum like $r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots$. The former is preferred for interpretability. However, it leads to an update algorithm with $\gamma = 1$. This is problematic because we use an estimate on the returns after a given action which is less reliable as we move further from the action taken. If we instead used the latter objective function, we would obtain an update as in Sutton and Barto page 328, which does not agree with the REINFORCE algorithm used in practice because of the additional factor of $\gamma^t$ appearing in the update.
In practice, the algorithm used is the one implemented in your course and, e.g., in stablebaselines. But it is worth noting that it does not update by the gradient of any objective function. That is the result in the first linked paper above. The second linked paper explains that, in general, we can only expect the REINFORCE algorithm to be stable and converge to the expected thing if we use a schedule for both the learning rate and $\gamma$ which moves the learning rate to zero (roughly harmonically) and moves $\gamma$ towards $1$ (in a manner controlled by the learning rate). See Theorem 2 in loc. cit. I haven't seen anyone actually implement that algorithm anywhere... maybe it's not worth it for the sake of this course.
I am not sure how you'd like to proceed. One possibility would be to do the entire discussion with $\gamma = 1$ so that there aren't mathematical errors, and then when implementing the algorithm say something like "We replace the return $G_t$ with the discounted version to account for variance in estimating rewards at later time steps."
Specific, smaller errors
The picture definition of $J(\theta)$ in policy-gradient.mdx is incorrect- $R(\tau)$ should be computed from time $0$ not an unspecified time $t$. The discount $\gamma$ should also probably be omitted (see the discussion in the previous section) to avoid an unaccounted extra factor in the policy gradient theorem.
The statement of the policy gradient theorem in policy-gradient.mdx should involve a sum over $t>=0$, and the more useful version would replace $R(\tau)$ with $G_t$.
The proof of the policy gradient theorem in pg-theorem.mdx looks correct, but we end up caring about a corollary thereof as in page 327 of Sutton&Barto where we replace $R(\tau)$ with $G_t$. The reason we can do this is explained nicely in spinning up.
The text was updated successfully, but these errors were encountered:
(Updated for clarity)
Apologies if I'm wrong, but it seems to me that there are some mathematical issues in unit 4 "diving deeper..." as well as in the optional section on the proof of the policy gradient theorem.
More major error
The main background issue seems to be an instance of the problems discussed in:
Specifically:
I am not sure how you'd like to proceed. One possibility would be to do the entire discussion with$\gamma = 1$ so that there aren't mathematical errors, and then when implementing the algorithm say something like "We replace the return $G_t$ with the discounted version to account for variance in estimating rewards at later time steps."
Specific, smaller errors
The text was updated successfully, but these errors were encountered: