Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about (time) ordering of data / predictions for the continue predictor #65

Open
PaulScemama opened this issue Jun 6, 2023 · 0 comments

Comments

@PaulScemama
Copy link

Hi, I'm quite inexperienced regarding Reinforcement Learning so forgive me if my question is trivial :). I have a quick question about the continue predictor.

In a typical Gym environment with an agent following a random policy, I've seen things like

for _ in range(num_episodes):                                                          # 1
  # First observation of an episode                                                    # 2
  obs, info = gym_env.reset()                                                          # 3
                                                                                       # 4
  done = False                                                                         # 5
  while not done:                                                                      # 6   
    action = gym_env.action_space.sample()                                             # 7
    observation, reward, done, _, _ = gym_env.step(action)                             # 8

The continue predictor is supposed to predict whether an episode will terminate or not. How I see it, for each (non-episode initializing step; lines 7-8) we get

  • an action | $a_t$
  • a reward resulting from the action | $r_t$
  • a "next" observation as a result of the action | $x_t$
  • a "done" (or alternatively continue) flag indicating if the episode has terminated | $c_t$

My question is: do we use $x_t$ to predict $c_t$? More specifically, does the stochastic posterior incorporate $x_t$ so that the "model state" (concatenation of deterministic state and stochastic state) is used to predict $c_t$?

Another way of asking the question: do we use the observation retrieved at the step that we also receive the continue flag, to predict the continue flag? I.e. in the line observation, reward, done, _, _ = gym_env.step(action), we incorporate the observation into the stochastic state to then help predict the done?

Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant