Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit4 policy gradient errors #285

Open
3 tasks
dylwil3 opened this issue Apr 14, 2023 · 1 comment
Open
3 tasks

Unit4 policy gradient errors #285

dylwil3 opened this issue Apr 14, 2023 · 1 comment

Comments

@dylwil3
Copy link
Contributor

dylwil3 commented Apr 14, 2023

(Updated for clarity)
Apologies if I'm wrong, but it seems to me that there are some mathematical issues in unit 4 "diving deeper..." as well as in the optional section on the proof of the policy gradient theorem.

More major error

The main background issue seems to be an instance of the problems discussed in:

Specifically:

  1. There are two types of objective functions $J(\theta)$: you could take the sum of all rewards $r_0+r_1+r_2\cdots$, or you could take a discounted sum like $r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots$. The former is preferred for interpretability. However, it leads to an update algorithm with $\gamma = 1$. This is problematic because we use an estimate on the returns after a given action which is less reliable as we move further from the action taken. If we instead used the latter objective function, we would obtain an update as in Sutton and Barto page 328, which does not agree with the REINFORCE algorithm used in practice because of the additional factor of $\gamma^t$ appearing in the update.
  2. In practice, the algorithm used is the one implemented in your course and, e.g., in stablebaselines. But it is worth noting that it does not update by the gradient of any objective function. That is the result in the first linked paper above. The second linked paper explains that, in general, we can only expect the REINFORCE algorithm to be stable and converge to the expected thing if we use a schedule for both the learning rate and $\gamma$ which moves the learning rate to zero (roughly harmonically) and moves $\gamma$ towards $1$ (in a manner controlled by the learning rate). See Theorem 2 in loc. cit. I haven't seen anyone actually implement that algorithm anywhere... maybe it's not worth it for the sake of this course.

I am not sure how you'd like to proceed. One possibility would be to do the entire discussion with $\gamma = 1$ so that there aren't mathematical errors, and then when implementing the algorithm say something like "We replace the return $G_t$ with the discounted version to account for variance in estimating rewards at later time steps."

Specific, smaller errors

  • The picture definition of $J(\theta)$ in policy-gradient.mdx is incorrect- $R(\tau)$ should be computed from time $0$ not an unspecified time $t$. The discount $\gamma$ should also probably be omitted (see the discussion in the previous section) to avoid an unaccounted extra factor in the policy gradient theorem.
  • The statement of the policy gradient theorem in policy-gradient.mdx should involve a sum over $t>=0$, and the more useful version would replace $R(\tau)$ with $G_t$.
  • The proof of the policy gradient theorem in pg-theorem.mdx looks correct, but we end up caring about a corollary thereof as in page 327 of Sutton&Barto where we replace $R(\tau)$ with $G_t$. The reason we can do this is explained nicely in spinning up.
@simoninithomas
Copy link
Member

Hey there 👋 thanks for the issue. I'm adding it to the next big update (next week) todo list🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants