Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalizing advantages or returns #20

Closed
DuaneNielsen opened this issue Feb 26, 2023 · 6 comments
Closed

Normalizing advantages or returns #20

DuaneNielsen opened this issue Feb 26, 2023 · 6 comments

Comments

@DuaneNielsen
Copy link

Hi Danijar,

Thanks so much for developing and sharing this algorithm. I've been following your work for some time and I think it's really great.

I'm attempting a re-implementation from scratch in pytorch and I have a question about the actor loss.

In the paper the loss is provided as...

image

However I see in the code we have something that looks more like an advantage estimate.

      rew, ret, base = critic.score(traj, self.actor)
      offset, invscale = self.retnorms[key](ret)
      normed_ret = (ret - offset) / invscale
      normed_base = (base - offset) / invscale
      advs.append((normed_ret - normed_base) * self.scales[key] / total)

If I'm interpreting the code correctly, normed_base seems to come from the value of the state the actor was in prior to transition, and normed_ret is the percentile scaled return as per the paper.

Also there is a little trick at the end..

    loss *= sg(traj['weight'])[:-1]

where 'weight' is computed as exponentially discounting the models future predictions

    cont = self.heads['cont'](traj).mode()
    traj['cont'] = jnp.concatenate([first_cont[None], cont[1:]], 0)
    discount = 1 - 1 / self.config.horizon
    traj['weight'] = jnp.cumprod(discount * traj['cont'], 0) / discount

This all makes sense, and is a good policy gradient, but in the paper it's mentioned that it's important to keep the scale of the policy gradient loss proportional to the entropy.

I noticed a big difference in policy gradient scale between using the returns as presented in the paper, and what we have here.

Would be great if you can clarify. Did I just misread the paper, or are these simply implementation details that don't matter a whole lot in practice?

@danijar
Copy link
Owner

danijar commented Feb 26, 2023

Hi @DuaneNielsen, the return is normalized as described in the paper. And the value baseline is normalized by the same statistics so that they are in the same space. They are then subtracted. Effectively, this means the advantage is normalized by the statistics of the return (not of the advantage). Does that answer your question?

@DuaneNielsen
Copy link
Author

DuaneNielsen commented Feb 26, 2023

Oh yeah. I get the value baseline is normalized by the same amount, so the advantage is normalized by the percentile scale. That's clear.

But if I read the paper.. I come to the conclusion that I should just do normed_ret (no advantage). Perhaps I'm going wrong here.

Empirically, the scale of normed_ret - normed_base is going to be about 20 times smaller than the scale of normed_ret - zero. I think...

So this will increase the role the entropy loss will play quite a bit. ie: if you use the advantage loss, you will explore a whole lot more. If you use the loss as per paper, you will explore a lot less. I noticed this when I ran tests.

Does that makes sense? Or do you think my reasoning/math is misguided?

In any case.. I think you have provided me what I was looking for... the advantage estimate is what I should use.

@danijar
Copy link
Owner

danijar commented Feb 27, 2023

The papers shows the gradient of the expectation of the (normalized) return. If you estimate that gradient using Reinforce with baseline, you get the expectation of the (normalized) advantage times the gradient of the policy logprob that's implemented in the code.

It's easy to confuse the two but the first one talks about returns and has the gradient outside of the expectation, whereas the second one talks about advantages and has the gradient inside. So the two are actually the same, despite looking different at first sight.

Thus, despite using the advantage instead of the return, the code correctly trades off expected normalized returns and entropy. This is in contrast to most existing policy gradient algorithms that trade off expected returns that are scaled by the advantage scale and entropy, which I think makes less sense.

@danijar danijar pinned this issue Feb 27, 2023
@danijar danijar closed this as completed Feb 27, 2023
@DuaneNielsen
Copy link
Author

DuaneNielsen commented Feb 27, 2023

Thanks a lot for clearing that up Danijar. I missed the subtlety of the expectation being on the outside vs inside, so I'll try deriving that to prove it to myself. Thanks for sharing the code and thanks for creating this wonderful algorithm!

@danijar
Copy link
Owner

danijar commented Feb 28, 2023

Cool! The derivation for Reinforce is basically this:

grad_pi E_pi(a|s)[ R(s,a) ]
= grad_pi sum_a pi(a|s) R(s,a) 
= sum_a R(s,a) grad_pi pi(a|s)
= sum_a pi(a|s)/pi(a|s) R(s,a) grad_pi pi(a|s)
= sum_a pi(a|s) R(s,a) (grad_pi pi(a|s)) / pi(a|s)
= sum_a pi(a|s) R(s,a) grad_pi log pi(a|s)
= E_pi(a|s)[ R(s,a) grad_pi log pi(a|s) ]

Then you can also show that subtracting a "baseline" from R(s, a) that doesn't depend on on a doesn't introduce any bias:

grad_pi E_pi(a|s)[ V(s) ]
= grad_pi sum_a pi(a|s) V(s)
= V(s) grad_pi sum_a pi(a|s) 
= V(s) grad_pi 1
= V(s) 0
 = 0

If you put these together, you get:

grad_pi E_pi(a|s)[ R(s,a) - V(s) ]
= grad_pi E_pi(a|s)[ A(s,a) ]
= E_pi(a|s)[ A(s,a) grad_pi log pi(a|s) ]

Now if you follow these derivations but with the return and baseline being normalized as in DreamerV3 where lo and hi are the percentiles 5 and 95 of the return batch (and possibly smoothed out over time with EMAs), you get:

grad_pi E_pi(a|s)[ (R(s,a) - lo) / max(1, hi - lo) - (V(s) - lo) / max(1, hi - lo) ]
= grad_pi E_pi(a|s)[ ((R(s,a) - lo) - (V(s) - lo)) / max(1, hi - lo) ]
= grad_pi E_pi(a|s)[ (R(s,a) - lo - V(s) + lo) / max(1, hi - lo) ]
= grad_pi E_pi(a|s)[ (R(s,a) - V(s)) / max(1, hi - lo) ]
= grad_pi E_pi(a|s)[ A(s,a) / max(1, hi - lo) ]
= E_pi(a|s)[ A(s,a) / max(1, hi - lo) grad_pi log pi(a | s) ]

By the way, I think this can also be explained much better in the paper. We'll update it at some point.

@danijar danijar changed the title Advantage or Returns Normalizing advantages or returns Mar 7, 2023
@DuaneNielsen
Copy link
Author

DuaneNielsen commented Mar 25, 2023

Just leaving a note to let you know I managed to get the algorithm working in PyTorch. Aside from a few minor details I think I"m pretty close. The code is still a bit rough, so after more testing, I'll probably rewrite the whole thing. Repo here First reproduction of Atari result here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants