Unit4 policy gradient errors #285

dylwil3 · 2023-04-14T19:29:44Z

(Updated for clarity)
Apologies if I'm wrong, but it seems to me that there are some mathematical issues in unit 4 "diving deeper..." as well as in the optional section on the proof of the policy gradient theorem.

More major error

The main background issue seems to be an instance of the problems discussed in:

Specifically:

There are two types of objective functions $J(\theta)$: you could take the sum of all rewards $r_0+r_1+r_2\cdots$, or you could take a discounted sum like $r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots$. The former is preferred for interpretability. However, it leads to an update algorithm with $\gamma = 1$. This is problematic because we use an estimate on the returns after a given action which is less reliable as we move further from the action taken. If we instead used the latter objective function, we would obtain an update as in Sutton and Barto page 328, which does not agree with the REINFORCE algorithm used in practice because of the additional factor of $\gamma^t$ appearing in the update.
In practice, the algorithm used is the one implemented in your course and, e.g., in stablebaselines. But it is worth noting that it does not update by the gradient of any objective function. That is the result in the first linked paper above. The second linked paper explains that, in general, we can only expect the REINFORCE algorithm to be stable and converge to the expected thing if we use a schedule for both the learning rate and $\gamma$ which moves the learning rate to zero (roughly harmonically) and moves $\gamma$ towards $1$ (in a manner controlled by the learning rate). See Theorem 2 in loc. cit. I haven't seen anyone actually implement that algorithm anywhere... maybe it's not worth it for the sake of this course.

I am not sure how you'd like to proceed. One possibility would be to do the entire discussion with $\gamma = 1$ so that there aren't mathematical errors, and then when implementing the algorithm say something like "We replace the return $G_t$ with the discounted version to account for variance in estimating rewards at later time steps."

Specific, smaller errors

The picture definition of $J(\theta)$ in policy-gradient.mdx is incorrect- $R(\tau)$ should be computed from time $0$ not an unspecified time $t$. The discount $\gamma$ should also probably be omitted (see the discussion in the previous section) to avoid an unaccounted extra factor in the policy gradient theorem.
The statement of the policy gradient theorem in policy-gradient.mdx should involve a sum over $t>=0$, and the more useful version would replace $R(\tau)$ with $G_t$.
The proof of the policy gradient theorem in pg-theorem.mdx looks correct, but we end up caring about a corollary thereof as in page 327 of Sutton&Barto where we replace $R(\tau)$ with $G_t$. The reason we can do this is explained nicely in spinning up.

simoninithomas · 2023-05-31T06:13:20Z

Hey there 👋 thanks for the issue. I'm adding it to the next big update (next week) todo list🤗

simoninithomas mentioned this issue May 31, 2023

Next Big Update (week of June the 26th) #336

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unit4 policy gradient errors #285

Unit4 policy gradient errors #285

dylwil3 commented Apr 14, 2023 •

edited

simoninithomas commented May 31, 2023

Unit4 policy gradient errors #285

Unit4 policy gradient errors #285

Comments

dylwil3 commented Apr 14, 2023 • edited

More major error

Specific, smaller errors

simoninithomas commented May 31, 2023

dylwil3 commented Apr 14, 2023 •

edited