True rewards remaining "zero" in the trajectories in stable baselines2 for custom environments #1167

moizuet · 2022-07-26T14:30:21Z

I am using reinforcement learning for mathematical optimization, using PPO2 agent in google colab.
In case of my custom environment, episode rewards are remaining zero when I saw the tensorboard. Also when I use print statement to print out the "true_reward" inside "ppo2.py" file (as shown in the figure), then I am getting nothing but zero vector.

Due to this, my agent is not learning correctly.

The following things are important to note here:

My environment is giving the agent nonzero rewards (I have checked it thoroughly) but on the agent side the rewards are
not being collected.
This thing happens mostly but not always, some times when I install stable-baselines the whole system works perfectly.
This thing happens only with my custom environment and not with other openai gym environments.

Miffyli · 2022-07-26T19:44:17Z

Hey. Unfortunately we do not have time to offer custom tech support for custom environments. The library code is tested to function (mostly) correctly, so my knee-jerk reply is that something may be off in your environment. I would recommend two things:

Try using stable-baselines3, as it is more mantained.
Use the check_env tool to check your environment (see docs here). This is part of SB3.

moizuet · 2022-07-27T04:25:49Z

I have checked my environment with check_env but unfortunately it is still giving me the error.

By the way I forgot to show the tensorboard plot, which is shown in the following figure (with the straight horizontal line for episode reward plot). I think @Miffyli , you are right, I am starting considering to migrate to stable_baselines3 (at least my next research project will not be in stable-baselines2).

But my code base is very long (spectral normalization, dense connection, custom "amsgrad" optimizer implementation and custom q-value network method for Soft Actor-Critic for the implementation of wolpertinger algorithm) which will be major cause my hesitation.

Miffyli · 2022-07-27T16:11:44Z

Unfortunately I do not have other tips to give and no time to start digging through custom code to find errors :( . I know this is a very bad, maybe even rude-ish, answer which assumes it is an user error, but there are many parts where env implementation can go wrong and cause confusing stuff like this. If possible, I would recommend taking an environment where the rewards work as expected and start changing it towards your final env step-by-step.

moizuet · 2022-07-28T09:57:37Z

No problem, I am trying resolving it, I will report the reasons as soon as I find them out.
By the way, I want to ask a question that in case we will use stable-baselines3 which is using Pytorch (i.e., eager mode of exacuation), will the training be slow relative to TensorFlow version of stable-baselines which is using graph mode (much faster computations)?

Miffyli · 2022-07-28T20:33:19Z

I think in SB3 other things become a bottleneck before the eager mode of PyTorch is the slowing down factor: handling the data, computing returns, etc etc takes much more time than actually running in the network graph. I personally do not know of the performance beyond RL, but AFAIK it is not worth the effort to change to TF2 just to get bit of speed boost.

moizuet · 2022-07-29T03:53:54Z

I think, if the number of CPUs (for parallel rollouts) are much larger then the number of GPU SMs, then the data will always be available for training and GPUs will always be busy, and thus it may be the case that the eager mode will become the bottle neck (which I think the same that it may not be too severe). Thanks alot!!

moizuet changed the title ~~True rewards remaining "zero" up in stable baselines2 trajectories for custom environments~~ True rewards remaining "zero" in the trajectories in stable baselines2 for custom environments Jul 26, 2022

Miffyli added the question Further information is requested label Jul 26, 2022

Miffyli added the custom gym env Issue related to Custom Gym Env label Jul 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

True rewards remaining "zero" in the trajectories in stable baselines2 for custom environments #1167

True rewards remaining "zero" in the trajectories in stable baselines2 for custom environments #1167

moizuet commented Jul 26, 2022

Miffyli commented Jul 26, 2022

moizuet commented Jul 27, 2022

Miffyli commented Jul 27, 2022 •

edited

moizuet commented Jul 28, 2022

Miffyli commented Jul 28, 2022

moizuet commented Jul 29, 2022

True rewards remaining "zero" in the trajectories in stable baselines2 for custom environments #1167

True rewards remaining "zero" in the trajectories in stable baselines2 for custom environments #1167

Comments

moizuet commented Jul 26, 2022

Miffyli commented Jul 26, 2022

moizuet commented Jul 27, 2022

Miffyli commented Jul 27, 2022 • edited

moizuet commented Jul 28, 2022

Miffyli commented Jul 28, 2022

moizuet commented Jul 29, 2022

Miffyli commented Jul 27, 2022 •

edited