Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

True rewards remaining "zero" in the trajectories in stable baselines2 for custom environments #1167

Open
moizuet opened this issue Jul 26, 2022 · 6 comments
Labels
custom gym env Issue related to Custom Gym Env question Further information is requested

Comments

@moizuet
Copy link

moizuet commented Jul 26, 2022

I am using reinforcement learning for mathematical optimization, using PPO2 agent in google colab.
In case of my custom environment, episode rewards are remaining zero when I saw the tensorboard. Also when I use print statement to print out the "true_reward" inside "ppo2.py" file (as shown in the figure), then I am getting nothing but zero vector.

Due to this, my agent is not learning correctly.

The following things are important to note here:

  1. My environment is giving the agent nonzero rewards (I have checked it thoroughly) but on the agent side the rewards are
    not being collected.
  2. This thing happens mostly but not always, some times when I install stable-baselines the whole system works perfectly.
  3. This thing happens only with my custom environment and not with other openai gym environments.

image

image

@moizuet moizuet changed the title True rewards remaining "zero" up in stable baselines2 trajectories for custom environments True rewards remaining "zero" in the trajectories in stable baselines2 for custom environments Jul 26, 2022
@Miffyli Miffyli added the question Further information is requested label Jul 26, 2022
@Miffyli
Copy link
Collaborator

Miffyli commented Jul 26, 2022

Hey. Unfortunately we do not have time to offer custom tech support for custom environments. The library code is tested to function (mostly) correctly, so my knee-jerk reply is that something may be off in your environment. I would recommend two things:

  1. Try using stable-baselines3, as it is more mantained.
  2. Use the check_env tool to check your environment (see docs here). This is part of SB3.

@Miffyli Miffyli added the custom gym env Issue related to Custom Gym Env label Jul 26, 2022
@moizuet
Copy link
Author

moizuet commented Jul 27, 2022

I have checked my environment with check_env but unfortunately it is still giving me the error.

By the way I forgot to show the tensorboard plot, which is shown in the following figure (with the straight horizontal line for episode reward plot). I think @Miffyli , you are right, I am starting considering to migrate to stable_baselines3 (at least my next research project will not be in stable-baselines2).

But my code base is very long (spectral normalization, dense connection, custom "amsgrad" optimizer implementation and custom q-value network method for Soft Actor-Critic for the implementation of wolpertinger algorithm) which will be major cause my hesitation.

image

@Miffyli
Copy link
Collaborator

Miffyli commented Jul 27, 2022

Unfortunately I do not have other tips to give and no time to start digging through custom code to find errors :( . I know this is a very bad, maybe even rude-ish, answer which assumes it is an user error, but there are many parts where env implementation can go wrong and cause confusing stuff like this. If possible, I would recommend taking an environment where the rewards work as expected and start changing it towards your final env step-by-step.

@moizuet
Copy link
Author

moizuet commented Jul 28, 2022

No problem, I am trying resolving it, I will report the reasons as soon as I find them out.
By the way, I want to ask a question that in case we will use stable-baselines3 which is using Pytorch (i.e., eager mode of exacuation), will the training be slow relative to TensorFlow version of stable-baselines which is using graph mode (much faster computations)?

@Miffyli
Copy link
Collaborator

Miffyli commented Jul 28, 2022

I think in SB3 other things become a bottleneck before the eager mode of PyTorch is the slowing down factor: handling the data, computing returns, etc etc takes much more time than actually running in the network graph. I personally do not know of the performance beyond RL, but AFAIK it is not worth the effort to change to TF2 just to get bit of speed boost.

@moizuet
Copy link
Author

moizuet commented Jul 29, 2022

I think, if the number of CPUs (for parallel rollouts) are much larger then the number of GPU SMs, then the data will always be available for training and GPUs will always be busy, and thus it may be the case that the eager mode will become the bottle neck (which I think the same that it may not be too severe). Thanks alot!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
custom gym env Issue related to Custom Gym Env question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants