Normalizing advantages or returns #20

DuaneNielsen · 2023-02-26T19:39:03Z

Hi Danijar,

Thanks so much for developing and sharing this algorithm. I've been following your work for some time and I think it's really great.

I'm attempting a re-implementation from scratch in pytorch and I have a question about the actor loss.

In the paper the loss is provided as...

However I see in the code we have something that looks more like an advantage estimate.

      rew, ret, base = critic.score(traj, self.actor)
      offset, invscale = self.retnorms[key](ret)
      normed_ret = (ret - offset) / invscale
      normed_base = (base - offset) / invscale
      advs.append((normed_ret - normed_base) * self.scales[key] / total)

If I'm interpreting the code correctly, normed_base seems to come from the value of the state the actor was in prior to transition, and normed_ret is the percentile scaled return as per the paper.

Also there is a little trick at the end..

    loss *= sg(traj['weight'])[:-1]

where 'weight' is computed as exponentially discounting the models future predictions

    cont = self.heads['cont'](traj).mode()
    traj['cont'] = jnp.concatenate([first_cont[None], cont[1:]], 0)
    discount = 1 - 1 / self.config.horizon
    traj['weight'] = jnp.cumprod(discount * traj['cont'], 0) / discount

This all makes sense, and is a good policy gradient, but in the paper it's mentioned that it's important to keep the scale of the policy gradient loss proportional to the entropy.

I noticed a big difference in policy gradient scale between using the returns as presented in the paper, and what we have here.

Would be great if you can clarify. Did I just misread the paper, or are these simply implementation details that don't matter a whole lot in practice?

The text was updated successfully, but these errors were encountered:

danijar · 2023-02-26T20:09:19Z

Hi @DuaneNielsen, the return is normalized as described in the paper. And the value baseline is normalized by the same statistics so that they are in the same space. They are then subtracted. Effectively, this means the advantage is normalized by the statistics of the return (not of the advantage). Does that answer your question?

DuaneNielsen · 2023-02-26T20:44:25Z

Oh yeah. I get the value baseline is normalized by the same amount, so the advantage is normalized by the percentile scale. That's clear.

But if I read the paper.. I come to the conclusion that I should just do normed_ret (no advantage). Perhaps I'm going wrong here.

Empirically, the scale of normed_ret - normed_base is going to be about 20 times smaller than the scale of normed_ret - zero. I think...

So this will increase the role the entropy loss will play quite a bit. ie: if you use the advantage loss, you will explore a whole lot more. If you use the loss as per paper, you will explore a lot less. I noticed this when I ran tests.

Does that makes sense? Or do you think my reasoning/math is misguided?

In any case.. I think you have provided me what I was looking for... the advantage estimate is what I should use.

danijar · 2023-02-27T02:30:14Z

The papers shows the gradient of the expectation of the (normalized) return. If you estimate that gradient using Reinforce with baseline, you get the expectation of the (normalized) advantage times the gradient of the policy logprob that's implemented in the code.

It's easy to confuse the two but the first one talks about returns and has the gradient outside of the expectation, whereas the second one talks about advantages and has the gradient inside. So the two are actually the same, despite looking different at first sight.

Thus, despite using the advantage instead of the return, the code correctly trades off expected normalized returns and entropy. This is in contrast to most existing policy gradient algorithms that trade off expected returns that are scaled by the advantage scale and entropy, which I think makes less sense.

DuaneNielsen · 2023-02-27T18:26:38Z

Thanks a lot for clearing that up Danijar. I missed the subtlety of the expectation being on the outside vs inside, so I'll try deriving that to prove it to myself. Thanks for sharing the code and thanks for creating this wonderful algorithm!

danijar · 2023-02-28T14:00:36Z

Cool! The derivation for Reinforce is basically this:

grad_pi E_pi(a|s)[ R(s,a) ]
= grad_pi sum_a pi(a|s) R(s,a) 
= sum_a R(s,a) grad_pi pi(a|s)
= sum_a pi(a|s)/pi(a|s) R(s,a) grad_pi pi(a|s)
= sum_a pi(a|s) R(s,a) (grad_pi pi(a|s)) / pi(a|s)
= sum_a pi(a|s) R(s,a) grad_pi log pi(a|s)
= E_pi(a|s)[ R(s,a) grad_pi log pi(a|s) ]

Then you can also show that subtracting a "baseline" from R(s, a) that doesn't depend on on a doesn't introduce any bias:

grad_pi E_pi(a|s)[ V(s) ]
= grad_pi sum_a pi(a|s) V(s)
= V(s) grad_pi sum_a pi(a|s) 
= V(s) grad_pi 1
= V(s) 0
 = 0

If you put these together, you get:

grad_pi E_pi(a|s)[ R(s,a) - V(s) ]
= grad_pi E_pi(a|s)[ A(s,a) ]
= E_pi(a|s)[ A(s,a) grad_pi log pi(a|s) ]

Now if you follow these derivations but with the return and baseline being normalized as in DreamerV3 where lo and hi are the percentiles 5 and 95 of the return batch (and possibly smoothed out over time with EMAs), you get:

grad_pi E_pi(a|s)[ (R(s,a) - lo) / max(1, hi - lo) - (V(s) - lo) / max(1, hi - lo) ]
= grad_pi E_pi(a|s)[ ((R(s,a) - lo) - (V(s) - lo)) / max(1, hi - lo) ]
= grad_pi E_pi(a|s)[ (R(s,a) - lo - V(s) + lo) / max(1, hi - lo) ]
= grad_pi E_pi(a|s)[ (R(s,a) - V(s)) / max(1, hi - lo) ]
= grad_pi E_pi(a|s)[ A(s,a) / max(1, hi - lo) ]
= E_pi(a|s)[ A(s,a) / max(1, hi - lo) grad_pi log pi(a | s) ]

By the way, I think this can also be explained much better in the paper. We'll update it at some point.

DuaneNielsen · 2023-03-25T16:56:43Z

Just leaving a note to let you know I managed to get the algorithm working in PyTorch. Aside from a few minor details I think I"m pretty close. The code is still a bit rough, so after more testing, I'll probably rewrite the whole thing. Repo here First reproduction of Atari result here

danijar pinned this issue Feb 27, 2023

danijar closed this as completed Feb 27, 2023

danijar changed the title ~~Advantage or Returns~~ Normalizing advantages or returns Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalizing advantages or returns #20

Normalizing advantages or returns #20

DuaneNielsen commented Feb 26, 2023

danijar commented Feb 26, 2023

DuaneNielsen commented Feb 26, 2023 •

edited

danijar commented Feb 27, 2023 •

edited

DuaneNielsen commented Feb 27, 2023 •

edited

danijar commented Feb 28, 2023

DuaneNielsen commented Mar 25, 2023 •

edited

Normalizing advantages or returns #20

Normalizing advantages or returns #20

Comments

DuaneNielsen commented Feb 26, 2023

danijar commented Feb 26, 2023

DuaneNielsen commented Feb 26, 2023 • edited

danijar commented Feb 27, 2023 • edited

DuaneNielsen commented Feb 27, 2023 • edited

danijar commented Feb 28, 2023

DuaneNielsen commented Mar 25, 2023 • edited

DuaneNielsen commented Feb 26, 2023 •

edited

danijar commented Feb 27, 2023 •

edited

DuaneNielsen commented Feb 27, 2023 •

edited

DuaneNielsen commented Mar 25, 2023 •

edited