Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong reward at terminal date when training with SB3 #1228

Open
iezfly opened this issue May 11, 2024 · 0 comments
Open

Wrong reward at terminal date when training with SB3 #1228

iezfly opened this issue May 11, 2024 · 0 comments

Comments

@iezfly
Copy link

iezfly commented May 11, 2024

Describe the bug
For StockTradingEnv when training in SB3 VecEnv (using get_sb_env), at the terminal date(let's say day X), it doesn't calculate reward since training data doesn't contain the next date closed price.
But when SB3 algorithm conducts collect_rollouts, it adds the previous reward (day X-1's reward) into rollout_buffer.

To Reproduce
Steps to reproduce the behavior:

  1. pick a short period training data to quickly reach terminal date when debuging , eg. 5days in total.
  2. break point after rollout_buffer.add() in on_policy_algorithm.py(SB3 module) line 232
  3. stepping until after day 5 (add the 4th experience)
  4. see the rollout_buffer variable (the reward[3] is the same as the previous one)

Expected behavior
after I examined SB3 code, it provides a solution for calculating the reward when next state is unobservable in episode. it use prediction value with discount as reward, while I'm not sure about its mathematic meaning.

            # see GitHub issue #633
            for idx, done in enumerate(dones):
                if (
                    done
                    and infos[idx].get("terminal_observation") is not None
                    and infos[idx].get("TimeLimit.truncated", False)
                ):
                    terminal_obs = self.policy.obs_to_tensor(infos[idx]["terminal_observation"])[0]
                    with th.no_grad():
                        terminal_value = self.policy.predict_values(terminal_obs)[0]  # type: ignore[arg-type]
                    rewards[idx] += self.gamma * terminal_value

But it requires the Env return "terminated = False, truncated = True." (while StockTradingEnv return "terminated = True, truncated = False:) # In SB3, truncated and terminated are mutually exclusive.

I think in stock trading scenario, the last day in training data should be a time limit truncation, but not a terminal, because it has future reward (value), unless it holds zero shares and the policy will not buy anymore at that state.

Screenshots
image

Additional context
I simply change the return in terminal case in the StockTradingEnv code, line 300.
return self.state, 0.0, False, True, {}. # reward = 0, could cause some issue

but I didn't consider the impact on ElegantRL, Ray and Portfolio Management scene.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant