PPO + JAX + EnvPool + MuJoCo #217

vwxyzjn · 2022-06-27T00:19:35Z

Description

Types of changes

Bug fix
New feature

Checklist:

I've read the CONTRIBUTION guide (required).
I have ensured pre-commit run --all-files passes (required).
I have updated the documentation and previewed the changes via mkdocs serve.
I have updated the tests accordingly (if applicable).

If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.

vercel · 2022-06-27T00:19:38Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
cleanrl	✅ Ready (Inspect)	Visit Preview	Jan 13, 2023 at 0:46AM (UTC)

gitpod-io · 2022-06-27T00:19:38Z

vwxyzjn · 2022-06-27T00:21:53Z

It seems that there isn't that much benefit in PPO - the SPS metric is not a lot better, as shown below.

Note: there is probably a bug... that's why the sample efficiency suffers.

Maybe I was implementing PPO using the incorrect paradigm with JAX. Any thoughts on this @joaogui1 and @ikostrikov? Thanks!

ikostrikov · 2022-06-27T00:35:38Z

I'm not sure if

obs = obs.at[step].set(x)

is indeed in-place inside of jit. I think in this specific case it still creates a new array. I think it's truly in-place only for specific use cases. For example, when memory is donated (on TPU and GPU only). Could you double check that?

vwxyzjn · 2022-06-27T00:40:04Z

cleanrl/ppo_continuous_action_envpool_jax.py

+        if args.anneal_lr:
+            frac = 1.0 - (update - 1.0) / num_updates
+            lrnow = frac * args.learning_rate
+            agent_optimizer_state[1].hyperparams["learning_rate"] = lrnow
+            agent_optimizer.update(agent_params, agent_optimizer_state)


It turns out these 4 lines of code slow down the throughput by a half. We are going to need a better learning rate annealing paradigm probably using the official API.

From my experience, there's a gain if the main for loop can be replaced with lax.fori_loop

vwxyzjn · 2022-06-27T02:02:16Z

The latest commit fixes two stupid bug, we now can match the exact same performance :)

vwxyzjn · 2022-06-27T04:09:29Z

I'm not sure if
obs = obs.at[step].set(x)
is indeed in-place inside of jit. I think in this specific case it still creates a new array. I think it's truly in-place only for specific use cases. For example, when memory is donated (on TPU and GPU only). Could you double check that?

Maybe the documentation meant if you had created an array inside the JIT the operation would be in place? I tested out

print("id(obs) before", id(obs))
obs, dones, actions, logprobs, values, action, logprob, entropy, value, key = get_action_and_value(
    next_obs, next_done, obs, dones, actions, logprobs, values, step, agent_params, key
)
print("id(obs) after", id(obs))

which gives

id(obs) before 140230683526704
id(obs) after 140230683590064

ikostrikov · 2022-06-27T04:35:55Z

@vwxyzjn yes, I think it's either for arrays created inside of jit or donated arguments.

StoneT2000 · 2022-06-30T04:57:31Z

cleanrl/ppo_continuous_action_envpool_jax.py

+        advantages = advantages.at[:].set(0.0)  # reset advantages
+        next_value = critic.apply(agent_params.critic_params, next_obs).squeeze()
+        lastgaelam = 0
+        for t in reversed(range(args.num_steps)):


Was looking through your codes to get some idea about how other people were writing RL algos in jax (and how far people jited things) and think this might be an issue during the first compile step. The for loop will basically be unrolled and when I tried this the compile time was very long especially if args.num_steps is big.

Ended up using jax.lax.scan and replaced the loop like this (code doesn't fit yours exactly but idea is there):

not_dones = ~dones value_diffs = gamma * values[1:] * not_dones - values[:-1] deltas = rewards + value_diffs def body_fun(gae, t): gae = deltas[t] + gamma * gae_lambda * not_dones[t] * gae return gae, gae indices = jnp.arange(N)[::-1] gae, advantages = jax.lax.scan(body_fun, 0.0, indices,) advantages = advantages[::-1]

Also avoids using the .at and .set functions (of which im still not sure of what the performance is). Maybe this might be useful.

you can use reverse=True in the scan so you don't have to flip it.

nico-bohlinger · 2022-08-16T02:09:40Z

Jitting the epochs in update_ppo() results in extremely high start up times for high epoch values and doesn't provide any speed after it's finally running.
Bringing the epoch loop in the main function would fix that, like:

for _ in range(args.update_epochs):
   agent_state, loss, pg_loss, v_loss, approx_kl, key = update_ppo(agent_state, storage, key)

vwxyzjn · 2022-10-22T23:53:01Z

cleanrl/ppo_continuous_action_envpool_jax.py

+    envs = gym.wrappers.ClipAction(envs)
+    envs = gym.wrappers.NormalizeObservation(envs)
+    envs = gym.wrappers.TransformObservation(envs, lambda obs: np.clip(obs, -10, 10))
+    envs = gym.wrappers.NormalizeReward(envs)
+    envs = gym.wrappers.TransformReward(envs, lambda reward: np.clip(reward, -10, 10))


It is desirable to implement these in jax, which should help speed up the training progress and will allow us to use the XLA interface in the future.

51616 · 2022-11-25T16:26:40Z

I think it's worth changing to lax.scan and fori_loop. Removing the for loop within rollout increases the speed quite a bit. Significantly reduces the complication time. I can make a pull request for this (and for compute_gae and update_ppo as well). I compared the original rollout and the lax.scan implementation and got the following results:

# Original for loop
Total data collection time: 135.69225978851318 seconds
Total data collection time without compilation: 98.75351285934448 seconds
Approx. compilation time: 36.93875765800476 seconds

# with lax.scan
Total data collection time: 60.91851544380188 seconds
Total data collection time without compilation: 60.029022455215454 seconds
Approx. compilation time: 0.8895087242126465 seconds

The command used is: python cleanrl/ppo_atari_envpool_xla_jax.py --env-id Breakout-v5 --total-timesteps 500000 --num-envs 32

Note: The training code was removed as the collection time correlates with the avg_episodic_length, which depends on the random exploration and training dynamics. Removing the training part makes sure that the numbers in the test only relate to the rollout function.

vwxyzjn · 2022-11-25T16:33:12Z

@51616 thanks for raising this issue. Could you share the snippet that derived these numbers?

~~Does lax.scan reduce the rollout time after compilation is finished?~~ nvm I misread something. It’s interesting the rollout time after compilation is much faster, and this would be a good reason to consider using scan. Would you mind preparing the PR?

51616 · 2022-11-26T12:44:27Z

@vwxyzjn Here's the code

    def step_once(carry, step, env_step_fn):
        (agent_state, episode_stats, next_obs, next_done, storage, key, handle) = carry
        storage, action, key = get_action_and_value(agent_state, next_obs, next_done, storage, step, key)
        episode_stats, handle, (next_obs, reward, next_done, _) = env_step_fn(episode_stats, handle, action)
        storage = storage.replace(rewards=storage.rewards.at[step].set(reward))
        return ((agent_state, episode_stats, next_obs, next_done, storage, key, handle), None)
    
    def rollout(agent_state, episode_stats, next_obs, next_done, storage, key, handle, global_step,
                step_once_fn, max_steps):
        
        (agent_state, episode_stats, next_obs, next_done, storage, key, handle), _ = jax.lax.scan(
            step_once_fn,
            (agent_state, episode_stats, next_obs, next_done, storage, key, handle), (), max_steps)
        
        global_step += max_steps * args.num_envs
        return agent_state, episode_stats, next_obs, next_done, storage, key, handle, global_step
    
    rollout_fn = partial(rollout,
                         step_once_fn=partial(step_once, env_step_fn=step_env_wrappeed),
                         max_steps=args.num_steps)
    
    for update in range(1, args.num_updates + 1):
        update_time_start = time.time()
        agent_state, episode_stats, next_obs, next_done, storage, key, handle, global_step = rollout_fn(
            agent_state, episode_stats, next_obs, next_done, storage, key, handle, global_step
        )
        if update == 1:
            start_time_wo_compilation = time.time()
        print("SPS:", int(global_step / (time.time() - start_time)))
        writer.add_scalar("charts/SPS", int(global_step / (time.time() - start_time)), global_step)
        print("SPS_update:", int(args.num_envs * args.num_steps / (time.time() - update_time_start)))
        writer.add_scalar(
            "charts/SPS_update", int(args.num_envs * args.num_steps / (time.time() - update_time_start)), global_step
        )
    print("Total data collection time:", time.time() - start_time, "seconds")
    print("Total data collection time without compilation:", time.time() - start_time_wo_compilation, "seconds")
    print("Approx. compilation time:", start_time_wo_compilation - start_time, "seconds")
    envs.close()
    writer.close()

I can make a PR for this. I also think we should use the output of the lax.scan as opposed to replacing the value inplace. Might look something like this

    def step_once(carry, step, env_step_fn):
        (agent_state, episode_stats, obs, done, key, handle) = carry
        action, logprob, value, key = get_action_and_value(agent_state, obs, key)
        
        episode_stats, handle, (next_obs, reward, next_done, _) = env_step_fn(episode_stats, handle, action)
        
        storage = Storage(
            obs=obs,
            actions=action,
            logprobs=logprob,
            dones=done,
            values=value,
            rewards=reward,
            returns=jnp.zeros_like(reward),
            advantages=jnp.zeros_like(reward),
        )
        
        return ((agent_state, episode_stats, next_obs, next_done, key, handle), storage)
    
    def rollout(agent_state, episode_stats, next_obs, next_done, key, handle,
                step_once_fn, max_steps):
        
        (agent_state, episode_stats, next_obs, next_done, key, handle), storage = jax.lax.scan(
            step_once_fn,
            (agent_state, episode_stats, next_obs, next_done, key, handle), (), max_steps)
        
        return agent_state, episode_stats, next_obs, next_done, key, handle, storage
    
    rollout_fn = partial(rollout,
                         step_once_fn=partial(step_once, env_step_fn=step_env_wrappeed),
                         max_steps=args.num_steps)
    
    for update in range(1, args.num_updates + 1):
        update_time_start = time.time()
        agent_state, episode_stats, next_obs, next_done, key, handle, storage = rollout_fn(
            agent_state, episode_stats, next_obs, next_done, key, handle
        )
        if update == 1:
            start_time_wo_compilation = time.time()
        global_step += args.num_steps * args.num_envs
        ...

The code is a bit cleaner and uses the output from lax.scan directly

vwxyzjn · 2023-01-13T00:45:35Z

pseudo-rnd-thoughts · 2023-08-30T13:42:01Z

@vwxyzjn Was there any reason why this wasn't merged in the end?

vwxyzjn · 2023-08-30T15:38:10Z

Nothing really. If you’d like free free to take on the PR :)

vwxyzjn added 14 commits May 29, 2022 13:40

prototype jax with ddpg

f127aa3

Quick fix

cbc5d88

quick fix

b4662c2

Commit changes - successful prototype

754a0b1

Remove scripts

223a8ff

Simplify the implementation: careful with shape

85fbfe2

Format

8ffbd26

Remove code

c72cfb7

formatting changes

bfece78

formatting change

0710728

bug fix

92d9d13

Prototype JAX + PPO + envpool's MuJoCo

cc6e2fa

next step

c769efc

successful prototype

30c4dde

remove ddpg

25397ec

vercel bot deployed to Preview June 27, 2022 00:22 View deployment

pre-commit

1f21964

vercel bot deployed to Preview June 27, 2022 00:30 View deployment

vwxyzjn commented Jun 27, 2022

View reviewed changes

stop gradient for approxkl

2bddebc

vercel bot deployed to Preview June 27, 2022 01:24 View deployment

stupid bug: fill dones and always squeeze in MSE

3f46f08

vercel bot deployed to Preview June 27, 2022 01:58 View deployment

Speed up 70% w/ official optimizer scheulder API

a0c56d3

vercel bot deployed to Preview June 27, 2022 03:16 View deployment

vercel bot deployed to Preview June 27, 2022 04:01 View deployment

vwxyzjn mentioned this pull request Jun 27, 2022

JAX Integration with CleanRL #218

Closed

5 tasks

StoneT2000 reviewed Jun 30, 2022

View reviewed changes

vwxyzjn added 2 commits July 6, 2022 23:32

minor refactor

2d67459

use TrainState

2093309

vercel bot deployed to Preview July 7, 2022 03:34 View deployment

values is not used

c411487

vercel bot deployed to Preview July 7, 2022 18:57 View deployment

vwxyzjn changed the title ~~Jax ppo envpool~~ JAX + PPO + EnvPool + MuJoCo Jul 12, 2022

vwxyzjn changed the title ~~JAX + PPO + EnvPool + MuJoCo~~ PPO + JAX + EnvPool + MuJoCo Jul 12, 2022

vwxyzjn added 3 commits July 15, 2022 11:23

Merge branch 'master' into jax-ppo-envpool

b128812

refactor

e27c81a

add seed

4b8e96b

vercel bot deployed to Preview July 15, 2022 23:17 View deployment

vwxyzjn commented Oct 22, 2022

View reviewed changes

vwxyzjn mentioned this pull request Oct 31, 2022

Proof-of-concept: Faster PyTorch #306

Closed

20 tasks

vwxyzjn mentioned this pull request Nov 21, 2022

SAC jax #300

Open

20 tasks

Merge branch 'master' into jax-ppo-envpool

336304b

vercel bot deployed to Preview January 12, 2023 23:45 View deployment

Adds two variants that uses jax.scan

3560371

vercel bot deployed to Preview January 13, 2023 00:46 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO + JAX + EnvPool + MuJoCo #217

PPO + JAX + EnvPool + MuJoCo #217

vwxyzjn commented Jun 27, 2022

vercel bot commented Jun 27, 2022 •

edited

gitpod-io bot commented Jun 27, 2022

vwxyzjn commented Jun 27, 2022 •

edited

ikostrikov commented Jun 27, 2022

vwxyzjn Jun 27, 2022

mavenlin Jun 27, 2022

vwxyzjn commented Jun 27, 2022

vwxyzjn commented Jun 27, 2022

ikostrikov commented Jun 27, 2022

StoneT2000 Jun 30, 2022 •

edited

luchris429 Sep 15, 2022

nico-bohlinger commented Aug 16, 2022

vwxyzjn Oct 22, 2022

51616 commented Nov 25, 2022

vwxyzjn commented Nov 25, 2022 •

edited

51616 commented Nov 26, 2022

vwxyzjn commented Jan 13, 2023

pseudo-rnd-thoughts commented Aug 30, 2023

vwxyzjn commented Aug 30, 2023

PPO + JAX + EnvPool + MuJoCo #217

Are you sure you want to change the base?

PPO + JAX + EnvPool + MuJoCo #217

Conversation

vwxyzjn commented Jun 27, 2022

Description

Types of changes

Checklist:

vercel bot commented Jun 27, 2022 • edited

gitpod-io bot commented Jun 27, 2022

vwxyzjn commented Jun 27, 2022 • edited

ikostrikov commented Jun 27, 2022

vwxyzjn Jun 27, 2022

Choose a reason for hiding this comment

mavenlin Jun 27, 2022

Choose a reason for hiding this comment

vwxyzjn commented Jun 27, 2022

vwxyzjn commented Jun 27, 2022

ikostrikov commented Jun 27, 2022

StoneT2000 Jun 30, 2022 • edited

Choose a reason for hiding this comment

luchris429 Sep 15, 2022

Choose a reason for hiding this comment

nico-bohlinger commented Aug 16, 2022

vwxyzjn Oct 22, 2022

Choose a reason for hiding this comment

51616 commented Nov 25, 2022

vwxyzjn commented Nov 25, 2022 • edited

51616 commented Nov 26, 2022

vwxyzjn commented Jan 13, 2023

pseudo-rnd-thoughts commented Aug 30, 2023

vwxyzjn commented Aug 30, 2023

vercel bot commented Jun 27, 2022 •

edited

vwxyzjn commented Jun 27, 2022 •

edited

StoneT2000 Jun 30, 2022 •

edited

vwxyzjn commented Nov 25, 2022 •

edited