Implement PPO-DNA algorithm for Atari #234

jseppanen · 2022-07-19T16:38:37Z

Description

Add implementation of PPO-DNA algorithm for Atari Envpool.

Paper reproduction (attempt)

Here's the episodic rewards after 200M environment steps (50M environment interactions before frame skip), compared to Fig. 6 in the original paper:

BattleZone: 82 000 ± 19 000 (matches with about 60 000 in the paper)
DoubleDunk: -3.5 ± 1.1 (worse than about 1.0 in the paper)
NameThisGame: 21 700 ± 2 500 (matches with about 20 000 in the paper)
Phoenix: 225 000 ± 76 000 (better than about 80 000 in the paper)
Qbert: 12 000 ± 5 000 (worse than about 30 000 in the paper)

However, I used the default networks and environments from the CleanRL PPO Atari implementation, so there are probably differences in them, vs. the original paper. In summary, when comparing against paper results, this implementation gets better returns in one task, comparable returns in two tasks, and worse returns in two out of the five tasks.

Results from figure 6 in the paper:

When comparing against ClearRL PPO Atari Envpool implementation, this implementation performs better on six out of nine tasks. See detailed learning curves, compared against CleanRL PPO Atari Envpool implementation:
PPO-DNA vs PPO on Atari Envpool

Reference

DNA: Proximal Policy Optimization with a Dual Network Architecture

cc @maitchison

Types of changes

Bug fix
New feature
New algorithm
Documentation

Checklist:

I've read the CONTRIBUTION guide (required).
I have ensured pre-commit run --all-files passes (required).
I have updated the documentation.
I have updated the tests accordingly (if applicable).

If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.

* improves performance * matches with DNA paper

vercel · 2022-07-19T16:38:44Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
cleanrl	✅ Ready (Inspect)	Visit Preview	Jan 12, 2023 at 4:45PM (UTC)

cleanrl/ppo_dna_atari_envpool.py

vwxyzjn · 2022-07-20T01:40:28Z

@maitchison has expressed interest in helping review this PR. Thank you, Matthew! I will also try to read the paper and add some comments.

maitchison · 2022-07-20T03:30:44Z

Small thing

Here's the episodic rewards after 200M environment steps (50M gradient updates), compared to Fig. 6 in the original paper:

should be

Here's the episodic rewards after 200M environment steps (50M environment interactions), compared to Fig. 6 in the original paper:

The algorithm will make many more than 50M gradient updates due to the number of mini-batches.

cleanrl/ppo_dna_atari_envpool.py

jseppanen · 2022-09-25T10:27:16Z

@vwxyzjn sure, I added benchmarks/ppo_dna.sh. Also, maybe I haven't communicated my experiment results clearly enough, I could also try and consolidate them into one place.

Disable dropout etc. in teacher network during distillation

Disable dropout etc. during rollouts

bragajj · 2022-11-02T00:46:38Z

@jseppanen wondering if you're still interested in this PR. I took a quick look at your wandb and see runs in your account for Phoenix-v5, NamesThisGame-v5, DoubleDunk-v5 at 50M steps, just not Pong. Trying to understand where we ended up here.

Was it that you were reporting results at different step counts despite having 50M step runs completed? Or that we just need ppo_envpool 50M results for comparison? Just want to see if I can provide any help or guidance.

maitchison · 2022-11-13T22:51:43Z

Hi @vwxyzjn, I now have some time I can put into this and would be happy to finish off the last few things that need doing. Looks like @jseppanen has got it mostly there, so it shouldn't take too long. I have access to a cluster I can run any additional experiments if needed.

vwxyzjn · 2022-11-20T02:20:46Z

Hey, sorry folks for not replying sooner. I looked into the PR a bit more and looks like @jseppanen has already addressed my comments on 50M steps. Thanks a lot. I generated the following plots using https://github.com/vwxyzjn/ppo-atari-metrics/blob/main/rlops.py

python rlops.py --wandb-project-name envpool-atari \
    --wandb-entity openrlbenchmark \
    --filters 'ppo_dna_atari_envpool_94fc331?wpn=cleanrl&we=jseppanen' 'ppo_atari_envpool_xla_jax_truncation?metric=charts/avg_episodic_return'   \
    --env-ids BattleZone-v5 DoubleDunk-v5 NameThisGame-v5 Phoenix-v5 Qbert-v5 Pong-v5 BeamRider-v5 Breakout-v5 Tennis-v5 \
    --output-filename compare.png --scan-history

The results look good. I will go ahead and run the Pong-v5 experiments to match the 10M steps, and that should be all for all of the experiments. @jseppanen would you mind moving the runs from your entity to openrlbenchmark? You can move them as shown in the video below:

Screen.Recording.2022-11-19.at.9.17.22.PM.mov

A note on environment preprocessing

The preprocessing steps should be like below, but full_action_space=True is not currently supported by EnvPool (sail-sg/envpool#220). Let's put this note in the documentation and not block this PR any longer.

envs = envpool.make(
    args.env_id,
    env_type="gym",
    num_envs=args.num_envs,
    episodic_life=False,  # Machado et al. 2017 (Revisitng ALE: Eval protocals) p. 6
    repeat_action_probability=0.25,  # Machado et al. 2017 (Revisitng ALE: Eval protocals) p. 12
    noop_max=1,  # Machado et al. 2017 (Revisitng ALE: Eval protocals) p. 12 (no-op is deprecated in favor of sticky action)
    max_episode_steps=int(108000 / 4),  # Hessel et al. 2018 (Rainbow DQN), Table 3, Max frames per episode
    reward_clip=True,
    seed=args.seed,
    # full_action_space=True, # currently not supported by EnvPool Machado et al. 2017 (Revisitng ALE: Eval protocals) Table 5
)

cleanrl/ppo_dna_atari_envpool.py

jseppanen added 15 commits July 12, 2022 11:37

First draft of PPO-DNA

dc7d1f2

Fix distillation learning rate decay

a9e50b8

Add argument for envpool threads

b0c4c45

Add exponential averaging to obs normalization

76b2943

Seed envpool environment explicitly

0c5230c

Bump default number of environments to 128

ee4d7a7

* improves performance * matches with DNA paper

Log gradients & upload final model to w&b

b440b4f

Fix wandb logging to follow --track option

22a16e2

Log environment step count

b2486cc

Remove unused --capture-video option

eb279a9

Fix distillation batch size argument

7550a5f

Fix deprecation warning on np.bool_

bbbdf2e

Use correct frame skip from env

e20da82

Add docs for PPO-DNA

f384cbb

Blacken

3d1c5ba

vercel bot deployed to Preview July 19, 2022 16:39 View deployment

build docs

7a66401

vercel bot deployed to Preview July 19, 2022 16:50 View deployment

format table

89c7f9d

vercel bot deployed to Preview July 20, 2022 01:22 View deployment

Add a note on environment preprocessing

981201f

vercel bot deployed to Preview July 20, 2022 01:27 View deployment

vwxyzjn reviewed Jul 20, 2022

View reviewed changes

cleanrl/ppo_dna_atari_envpool.py Show resolved Hide resolved

maitchison reviewed Jul 20, 2022

View reviewed changes

cleanrl/ppo_dna_atari_envpool.py Show resolved Hide resolved

maitchison reviewed Jul 20, 2022

View reviewed changes

cleanrl/ppo_dna_atari_envpool.py Show resolved Hide resolved

maitchison reviewed Jul 20, 2022

View reviewed changes

cleanrl/ppo_dna_atari_envpool.py Show resolved Hide resolved

Update hyperparam defaults to match paper

0fe7b1f

Run value network in eval mode in distillation

d259368

Disable dropout etc. in teacher network during distillation

vercel bot deployed to Preview September 27, 2022 18:44 View deployment

Do rollouts in eval mode

c071298

Disable dropout etc. during rollouts

vercel bot deployed to Preview September 28, 2022 04:44 View deployment

Merge branch 'master' into ppo-dna

5d913a6

vercel bot had a problem deploying to Preview November 20, 2022 01:07 Failure

Remove duplicate adv calculation (see vwxyzjn#287)

c052f44

vercel bot had a problem deploying to Preview November 20, 2022 01:11 Failure

vwxyzjn added 2 commits November 19, 2022 20:41

remove OMP_NUM_THREADS=1

f4501be

Merge branch 'master' into ppo-dna

684001a

vercel bot deployed to Preview November 20, 2022 01:42 View deployment

push changes

95bca3b

vercel bot deployed to Preview November 20, 2022 01:44 View deployment

Try matching the env initializion in the paper

3e35b8a

vercel bot deployed to Preview November 20, 2022 01:50 View deployment

revert change

b156f7d

vercel bot deployed to Preview November 20, 2022 01:56 View deployment

revert changes

f89e68d

vercel bot deployed to Preview November 20, 2022 15:46 View deployment

vwxyzjn reviewed Nov 22, 2022

View reviewed changes

cleanrl/ppo_dna_atari_envpool.py Outdated Show resolved Hide resolved

bug fix

6595b4c

vercel bot deployed to Preview November 22, 2022 14:45 View deployment

update script

52c9c55

vercel bot deployed to Preview November 22, 2022 14:47 View deployment

Merge branch 'master' into ppo-dna

caabea4

vercel bot deployed to Preview January 12, 2023 16:45 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement PPO-DNA algorithm for Atari #234

Implement PPO-DNA algorithm for Atari #234

jseppanen commented Jul 19, 2022 •

edited by vwxyzjn

vercel bot commented Jul 19, 2022 •

edited

vwxyzjn commented Jul 20, 2022

maitchison commented Jul 20, 2022

jseppanen commented Sep 25, 2022

bragajj commented Nov 2, 2022 •

edited

maitchison commented Nov 13, 2022

vwxyzjn commented Nov 20, 2022

Implement PPO-DNA algorithm for Atari #234

Are you sure you want to change the base?

Implement PPO-DNA algorithm for Atari #234

Conversation

jseppanen commented Jul 19, 2022 • edited by vwxyzjn

Description

Paper reproduction (attempt)

Reference

Types of changes

Checklist:

vercel bot commented Jul 19, 2022 • edited

vwxyzjn commented Jul 20, 2022

maitchison commented Jul 20, 2022

jseppanen commented Sep 25, 2022

bragajj commented Nov 2, 2022 • edited

maitchison commented Nov 13, 2022

vwxyzjn commented Nov 20, 2022

A note on environment preprocessing

jseppanen commented Jul 19, 2022 •

edited by vwxyzjn

vercel bot commented Jul 19, 2022 •

edited

bragajj commented Nov 2, 2022 •

edited