PPO Implementation Ignores Time Limits #20

bheijden · 2024-03-30T10:23:48Z

Hi,

The current PPO implementation does not seem to account for time limits. While the EpisodeWrapper from brax is used, which tracks a truncation flag (source) in the info dictionary for correct termination handling, it appears this aspect is overlooked in the implementation.

Related Information:

Issue discussing the treatment of infinite horizon tasks as episodic: DLR-RM/stable-baselines3#284
Paper on Time Limits in Reinforcement Learning: Time Limits in RL (arXiv)

Could this be an oversight, or am I missing a part of the implementation that addresses this?

The text was updated successfully, but these errors were encountered:

luchris429 · 2024-04-05T14:40:23Z

I believe there's ongoing discussion on this for CleanRL, though I've not caught up with the latest.

vwxyzjn/cleanrl#198

My understanding is that properly handling this does not usually result in significant performance differences.

sail-sg/envpool#194 (comment)

luchris429 · 2024-04-05T14:42:45Z

That being said, if you would be interested in doing a PR for this with another file (say, ppo_time_limits.py), that would be great!

bheijden · 2024-04-06T08:23:47Z

Thanks, that clears things up. Wasn't sure if it was perhaps handled elsewhere.

Concerning the ablation:
It looks like those benchmarks were done using Atari games, which, as far as I understand, aren't impacted by truncation—they usually just end or terminate. Truncation is more about situations where you have endless tasks, which is common in robotics scenarios like with the Ant or Cheetah. So, I'd be cautious about basing any conclusions solely on studies from Atari games. In fact, there are simpler settings that absolutely require proper truncation management to be solved, like the example from Time Limits in RL (arXiv) in the infinite horizon case:

If I end up requiring truncation, I'll see if I can cook up a PR.

luchris429 · 2024-04-11T14:00:56Z

That's a good point! I think this could be worth doing in a separate file so people can see the differences. There is a significant downside of doubling the observation size.

gmarkkula mentioned this issue Apr 10, 2024

PPO timeout proper handling vwxyzjn/cleanrl#198

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO Implementation Ignores Time Limits #20

PPO Implementation Ignores Time Limits #20

bheijden commented Mar 30, 2024

luchris429 commented Apr 5, 2024

luchris429 commented Apr 5, 2024

bheijden commented Apr 6, 2024 •

edited

luchris429 commented Apr 11, 2024

PPO Implementation Ignores Time Limits #20

PPO Implementation Ignores Time Limits #20

Comments

bheijden commented Mar 30, 2024

luchris429 commented Apr 5, 2024

luchris429 commented Apr 5, 2024

bheijden commented Apr 6, 2024 • edited

luchris429 commented Apr 11, 2024

bheijden commented Apr 6, 2024 •

edited