You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is more a question than a problem
In the ppo implementation, why advantage normalization is done for pg_loss, not for vf_loss? Say we have a rl env with a dense reward ranging from 0 to 1000 pre step, With adv normalization for pg_loss alone, we have 100x scale difference between pg_loss and vf_loss! Which, as I know, directly affect the learning speed(performance). Because if you have a loss function timed by a big constant, you may better lower the learning rate. But as I know, cleanRL's implementation of ppo use the same lr for both value function and policy function.
My question is: isn't it more reasonable to apply adv norm to both pg_loss and vf_loss to make the loss scale the same?
The text was updated successfully, but these errors were encountered:
I have a similar issue. When the reward is large, the loss from the value function is huge compared to the policy loss, and training is unstable. One way to solve it is to rescale reward. It breaks when reward in the later time steps are several magnitude difference than the early time steps.
Problem Description
This is more a question than a problem
In the ppo implementation, why advantage normalization is done for pg_loss, not for vf_loss? Say we have a rl env with a dense reward ranging from 0 to 1000 pre step, With adv normalization for pg_loss alone, we have 100x scale difference between pg_loss and vf_loss! Which, as I know, directly affect the learning speed(performance). Because if you have a loss function timed by a big constant, you may better lower the learning rate. But as I know, cleanRL's implementation of ppo use the same lr for both value function and policy function.
My question is: isn't it more reasonable to apply adv norm to both pg_loss and vf_loss to make the loss scale the same?
The text was updated successfully, but these errors were encountered: