Enhancing Termination and Truncation Handling in CleanRL's PPO Algorithm #448

Josh00-Lu · 2024-02-29T04:50:13Z

Description

I have noticed that the issues regarding termination and truncation in the PPO (Proximal Policy Optimization) algorithm have not been addressed and resolved. Therefore, I have incorporated relevant handling methods. Specifically, for a vector environment (vec env), when termination or truncation is received at the current time step, the actual next state will be provided in the "final_observation", and we need to record this information. Also, here, done = termination or truncation.

On the other hand, when calculating delta and advantage using GAE (Generalized Advantage Estimation), delta should be computed using termination, while advantages should be computed with done. For more details, please refer to the PR (Pull Request) codes and comments I have provided.

Notes:

Since we only next dones[t+1] and terminations[t+1] when computing advantages using GAE, I have make slight code adjustments to simplify the code logics.
Now we need to store next_terminations and next_obs.

Types of changes

Bug fix
New feature
New algorithm
Documentation

Checklist:

I've read the CONTRIBUTION guide (required).
I have ensured pre-commit run --all-files passes (required).
I have updated the tests accordingly (if applicable).
I have updated the documentation and previewed the changes via mkdocs serve.
- I have explained note-worthy implementation details.
- I have explained the logged metrics.
- I have added links to the original paper and related papers.

If you need to run benchmark experiments for a performance-impacting changes:

I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team.
I have used the benchmark utility to submit the tracked experiments to the openrlbenchmark/cleanrl W&B project, optionally with --capture_video.
I have performed RLops with python -m openrlbenchmark.rlops.
- For new feature or bug fix:
  - I have used the RLops utility to understand the performance impact of the changes and confirmed there is no regression.
- For new algorithm:
  - I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
- I have added the learning curves generated by the python -m openrlbenchmark.rlops utility to the documentation.
- I have added links to the tracked experiments in W&B, generated by python -m openrlbenchmark.rlops ....your_args... --report, to the documentation.

vercel · 2024-02-29T04:50:17Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
cleanrl	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 29, 2024 3:06pm

vwxyzjn · 2024-04-14T19:03:17Z

Hi @Josh00-Lu thanks for the PR. Have you had cases where handling termination / truncation has made a huge difference in the performance?

Josh00-Lu · 2024-04-15T08:02:54Z

Hi @Josh00-Lu thanks for the PR. Have you had cases where handling termination / truncation has made a huge difference in the performance?

Yes. Especially for mujoco environments like swimmer, where truncation are encountered more frequently than termination. When termination is predominant, this problem is not as obvious (but explicit handling does not negatively affect performance). I think this improvement is intuitive, helping to use GAE for more reasonable Temporal Credit Assignment in PPO, as it explicitly recognizes the difference between termination and truncation. Admittedly, scaled extensive experiments may be needed to verify this.

gmarkkula · 2024-04-17T16:48:40Z

+1 on seeing cases where it makes a big difference. CleanRL consistently performed much worse than SB3 on one of my environments, but when I added the bootstrapping from terminal observation to CleanRL it achieved the same performance as SB3. See this comment in another thread on this topic.

varadVaidya · 2024-04-23T10:30:19Z

+1 on massive improvements using the PR's changes. I have a custom quadrotor improvement, that works well with SB3, but implementation of cleanRL gave poor performances, including catastrophic forgetting of some progress on the best seeds (the seeds which gave best performance on SB3). Using the PR's correct handling of the truncation, i have matching performance wrt SB3

Josh00-Lu · 2024-04-23T10:42:05Z

+1 on massive improvements using the PR's changes. I have a custom quadrotor improvement, that works well with SB3, but implementation of cleanRL gave poor performances, including catastrophic forgetting of some progress on the best seeds (the seeds which gave best performance on SB3). Using the PR's correct handling of the truncation, i have matching performance wrt SB3

Thank you. It's great to see that!

sdpkjc · 2024-04-23T11:47:46Z

Thank you for this PR! I had previously noticed the issue and made a simple modification to the CleanRL PPO by referencing the fix done in DLR-RM/stable-baselines3#658. Here is my code: https://gist.github.com/sdpkjc/71cbba3bd439d754ef9c026c31658ecd#file-ppo_continuous_action-py-L431-L437

It only requires adding a few lines to the existing code.

I found that this modification offers limited performance improvements in the default Mujoco environments, but it yields benefits in scenarios that frequently involve truncation.

I also noticed that Costa @vwxyzjn seems to have been interested in this issue previously. In his blog post https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/, he mentioned this problem with PPO. However, he opted not to fix it in his implementation for the sake of fidelity to the original algorithm.

Whether or not to address this issue depends on whether CleanRL prioritizes higher performance or a more faithful implementation of the original algorithm. This is a topic worthy of our discussion. As far as I know, sb3 has addressed this issue.

Perhaps we could consider adding an argument to allow users to choose whether to handle this correctly.

vwxyzjn · 2024-04-23T13:34:34Z

Thanks all for the discussion. I had opted out to implement the correct behavior also because at some point I did a comprehensive eval and it did not seem to matter. There are also issues with efficient implementation. The current gymnasium's design is highly inefficient that we need to do extra forward passes whenever there is the terminal states.

sail-sg/envpool#194 (comment)

pseudo-rnd-thoughts · 2024-05-28T09:15:14Z

Sorry for the late message, personally, I would advocate for a separate file with this implementation if the original isn't wanting to be changed. This would prevent the problem people report above that they would need to make the change themselves.
Second, @vwxyzjn Gymnasium v1.0.0 is updated to the vector implementation to match EnvPool in autoreset style, i.e., no more final_observation therefore, most of CleanRL will need to be updated to reflect this change. Meaning that the change suggested here will not cause the additional forward pass.

Therefore, my suggest would be to reopen this PR, include it in a new file, ppo_truncted.py?? and update to Gymnasium v1.0.0 within or after

Josh00-Lu · 2024-05-29T15:12:12Z

Sorry for the late message, personally, I would advocate for a separate file with this implementation if the original isn't wanting to be changed. This would prevent the problem people report above that they would need to make the change themselves. Second, @vwxyzjn Gymnasium v1.0.0 is updated to the vector implementation to match EnvPool in autoreset style, i.e., no more final_observation therefore, most of CleanRL will need to be updated to reflect this change. Meaning that the change suggested here will not cause the additional forward pass.

Therefore, my suggest would be to reopen this PR, include it in a new file, ppo_truncted.py?? and update to Gymnasium v1.0.0 within or after

I agree with your idea. I have re-opened this PR and included it in a new file called ppo_continuous_action_truncted.py while keeping the ppo_continuous_action.py unchanged. If other files in this repository are updated to Gymnasium v1.0.0, I will make the following adjustments to this file accordingly.

fix bugs for termination & truncation for PPO

f9ad20a

vercel bot deployed to Preview February 29, 2024 04:50 View deployment

pre-commit run check

512160b

vercel bot deployed to Preview February 29, 2024 04:56 View deployment

pseudo-rnd-thoughts mentioned this pull request Apr 17, 2024

Correct handling of termination vs truncation? #457

Open

vercel bot deployed to Preview April 23, 2024 12:39 View deployment

vercel bot deployed to Preview April 23, 2024 13:00 View deployment

Josh00-Lu force-pushed the sync branch from e92f196 to 512160b Compare April 23, 2024 13:28

Merge branch 'vwxyzjn:master' into sync

31a91a2

vercel bot deployed to Preview April 23, 2024 13:29 View deployment

Josh00-Lu closed this May 27, 2024

Josh00-Lu reopened this May 29, 2024

vercel bot deployed to Preview May 29, 2024 15:00 View deployment

Josh00-Lu force-pushed the sync branch from 625aba7 to 31a91a2 Compare May 29, 2024 15:05

Implementation for Handling Trunction with PPO

aecb8a6

vercel bot deployed to Preview May 29, 2024 15:06 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing Termination and Truncation Handling in CleanRL's PPO Algorithm #448

Enhancing Termination and Truncation Handling in CleanRL's PPO Algorithm #448

Josh00-Lu commented Feb 29, 2024 •

edited

vercel bot commented Feb 29, 2024 •

edited

vwxyzjn commented Apr 14, 2024

Josh00-Lu commented Apr 15, 2024 •

edited

gmarkkula commented Apr 17, 2024

varadVaidya commented Apr 23, 2024

Josh00-Lu commented Apr 23, 2024 •

edited

sdpkjc commented Apr 23, 2024 •

edited

vwxyzjn commented Apr 23, 2024

pseudo-rnd-thoughts commented May 28, 2024 •

edited

Josh00-Lu commented May 29, 2024 •

edited

Enhancing Termination and Truncation Handling in CleanRL's PPO Algorithm #448

Are you sure you want to change the base?

Enhancing Termination and Truncation Handling in CleanRL's PPO Algorithm #448

Conversation

Josh00-Lu commented Feb 29, 2024 • edited

Description

Types of changes

Checklist:

vercel bot commented Feb 29, 2024 • edited

vwxyzjn commented Apr 14, 2024

Josh00-Lu commented Apr 15, 2024 • edited

gmarkkula commented Apr 17, 2024

varadVaidya commented Apr 23, 2024

Josh00-Lu commented Apr 23, 2024 • edited

sdpkjc commented Apr 23, 2024 • edited

vwxyzjn commented Apr 23, 2024

pseudo-rnd-thoughts commented May 28, 2024 • edited

Josh00-Lu commented May 29, 2024 • edited

Josh00-Lu commented Feb 29, 2024 •

edited

vercel bot commented Feb 29, 2024 •

edited

Josh00-Lu commented Apr 15, 2024 •

edited

Josh00-Lu commented Apr 23, 2024 •

edited

sdpkjc commented Apr 23, 2024 •

edited

pseudo-rnd-thoughts commented May 28, 2024 •

edited

Josh00-Lu commented May 29, 2024 •

edited