Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Implement Recurrent SAC #201

Open
2 tasks done
masterdezign opened this issue Aug 2, 2023 · 17 comments
Open
2 tasks done

[Feature Request] Implement Recurrent SAC #201

masterdezign opened this issue Aug 2, 2023 · 17 comments
Labels
enhancement New feature or request

Comments

@masterdezign
Copy link

masterdezign commented Aug 2, 2023

🚀 Feature

Hi!

I would like to implement a recurrent soft actor-critic. Is it a sensible contribution?

Motivation

I actually need this algorithm in my projects.

Pitch

The sb3 ecosystem would benefit from yet another algorithm. As a new contributor, I might need a little guidance though.

Alternatives

An alternative would be another off-policy algorithm using LSTM.

Additional context

No response

Checklist

  • I have checked that there is no similar issue in the repo
  • If I'm requesting a new feature, I have proposed alternatives
@masterdezign masterdezign added the enhancement New feature or request label Aug 2, 2023
@araffin
Copy link
Member

araffin commented Aug 3, 2023

Hello,
this would be definitely a good addition to SB3 contrib.

Make sure to read the contributing guide carefully.
You might have a look at R2D2 paper (https://paperswithcode.com/method/r2d2) and https://github.com/zhihanyang2022/off-policy-continuous-control.

For benchmarking, best would be to use the "NoVel" env that are available in the RL Zoo (see https://wandb.ai/sb3/no-vel-envs/reports/PPO-vs-RecurrentPPO-aka-PPO-LSTM-on-environments-with-masked-velocity-SB3-Contrib---VmlldzoxOTI4NjE4).

@masterdezign
Copy link
Author

Thanks for the references. I will check them out and come back.

@masterdezign
Copy link
Author

Just a quick update: I plan to do this by the end of 2023 when I have some free time. Currently I have three higher priority projects.

@masterdezign
Copy link
Author

masterdezign commented Dec 21, 2023

Status update:

  1. I've checked the resources that you provided, thanks a lot. I find the code to be nicely written and quite easy to understand.
  2. I managed to solve PendulumNoVel-v1 from rl_zoo3==2.1.0 with RSAC.
  3. However, I have trouble solving MountainCarContinuousNoVel-v0 and LunarLanderContinuousNoVel-v2 using the code above with different configurations.
  4. Therefore, I may need to resort to modify the algorithm (such as e.g. using the same LSTM state by the actor and critics, using overlapping segments, etc.).
  5. EDIT I've checked your benchmarks and realized that LunarLander may require more timesteps (it takes up to 5M for PPO LSTM).

@masterdezign
Copy link
Author

masterdezign commented Dec 28, 2023

Comparison

I've got these results on LunarLanderContinuousNoVel-v2 (rl_zoo3==2.1.0) using RSAC with shared LSTM state (rsac_s) and RSAC. In both cases, the configuration was the same:

# ====================================================================================
# gin macros
# ====================================================================================

capacity = 1000
batch_size = 10
segment_len = 50

num_epochs = 500
num_steps_per_epoch = 10000
update_after = 10000
num_test_episodes_per_epoch = 10

# ====================================================================================
# applying the parameters
# ====================================================================================

import basics.replay_buffer_recurrent
import basics.run_fns

basics.replay_buffer_recurrent.RecurrentReplayBuffer.capacity = %capacity
basics.replay_buffer_recurrent.RecurrentReplayBuffer.batch_size = %batch_size
basics.replay_buffer_recurrent.RecurrentReplayBuffer.segment_len = %segment_len

basics.run_fns.train.num_epochs = %num_epochs
basics.run_fns.train.num_steps_per_epoch = %num_steps_per_epoch
basics.run_fns.train.num_test_episodes_per_epoch = %num_test_episodes_per_epoch
basics.run_fns.train.update_after = %update_after

It took about 20 hours to compute per run. Perhaps now this rsac_s architecture can be implemented in sb3-contrib.

rsac_s-241

@araffin araffin added the Maintainers on vacation Maintainers are on vacation so they can recharge their batteries, we will be back soon ;) label Dec 28, 2023
@araffin araffin removed the Maintainers on vacation Maintainers are on vacation so they can recharge their batteries, we will be back soon ;) label Jan 15, 2024
@araffin
Copy link
Member

araffin commented Jan 15, 2024

Hello,
thanks for reporting the updated results =).
Do you have a diagram to share for RSAC vs RSAC_s maybe? (that would make things easier to discuss)

Di you also manage to solve the mountain car problem?

@masterdezign
Copy link
Author

Di you also manage to solve the mountain car problem?

I believe, yes. Let me render the env to verify since rewards are not the same for MountainCarContinuousNoVel-v0 (continuous action space) and MountainCar-v0 (discrete action space).

@masterdezign
Copy link
Author

Loosely speaking, here they are:


           RSAC                        RSAC_S

     ┌─────┐    ┌─────┐               ┌─────┐
     │ RNN │    │ RNN │             ┌─┤ RNN │..
     └──┬──┘    └──┬──┘             │ └─────┘ .
        │          │                │         .
        │          │                │         .
    ┌───┴───┐  ┌───┴────┐       ┌───┴───┐  ┌────────┐
    │ Actor │  │ Critic │       │ Actor │  │ Critic │
    └───────┘  └────────┘       └───────┘  └────────┘

As you can see, RSAC_S share the RNN state between the actor and the critic, but only actor can change the RNN state. Whereas in RSAC actor and critics have their own RNN states.

@araffin
Copy link
Member

araffin commented Jan 16, 2024

As you can see, RSAC_S share the RNN state between the actor and the critic, but only actor can change the RNN state. Whereas in RSAC actor and critics have their own RNN states.

thanks, similar to what is implemented for PPO:

if self.lstm_critic is not None:
latent_vf, lstm_states_vf = self._process_sequence(vf_features, lstm_states.vf, episode_starts, self.lstm_critic)
elif self.shared_lstm:
# Re-use LSTM features but do not backpropagate
latent_vf = latent_pi.detach()
lstm_states_vf = (lstm_states_pi[0].detach(), lstm_states_pi[1].detach())
else:
# Critic only has a feedforward network
latent_vf = self.critic(vf_features)
lstm_states_vf = lstm_states_pi

@masterdezign
Copy link
Author

masterdezign commented Jan 16, 2024

Update, I just rendered MountainCarContinuousNoVel-v0 and it is not solved yet. I don't quite understand why the total reward is different between the original MountainCar-v0 env and this one. Therefore, I need to check MountainCarContinuousNoVel-v0 (and MountainCarContinuous-v0) in details.

@araffin
Copy link
Member

araffin commented Jan 16, 2024

i can help you with that, the continuous version has a deceptive reward and need quite some exploration noise

EDIT: working hyperparameters: https://github.com/DLR-RM/rl-baselines3-zoo/blob/8cecab429726d7e6aaebd261d26ed8fc23b7d948/hyperparams/sac.yml#L2
or https://github.com/DLR-RM/rl-baselines3-zoo/blob/8cecab429726d7e6aaebd261d26ed8fc23b7d948/hyperparams/td3.yml#L5-L6

(note: the gSDE exploration is important there, otherwise a high OU noise would work too)

@masterdezign
Copy link
Author

Thanks, I'll check those hyperparameters.

@masterdezign
Copy link
Author

masterdezign commented Jan 24, 2024

Indeed, having use_sde=True seems helping to solve MountainCarContinuous-v0 environment. I am curious which gSDE ingredient does exactly help.

Edit: I also tried nearby hyperparameters and indeed gSDE contribution seems to be non-negligible.

@araffin
Copy link
Member

araffin commented Jan 24, 2024

I am curious which gSDE ingredient does exactly help.

The consistent exploration. To solve this task, you need to build-up momentum, having a bang-bang like strategy is one way (it is discuss a bit longer in the first version of the paper: https://arxiv.org/pdf/2005.05719v1.pdf).

Edit: I also tried nearby hyperparameters and indeed gSDE contribution seems to be non-negligible.

I did a full hyperparameters search and with gSDE many are working (more than half of the tested configurations): https://github.com/DLR-RM/rl-baselines3-zoo/blob/sde/logs/report_sde_MountainCarContinuous-v0_500-trials-50000-tpe-median_1581693633.csv

@masterdezign
Copy link
Author

I am currently checking the two strategies for RNN state initialization, proposed in R2D2 paper (store state and burn-in).

@masterdezign
Copy link
Author

masterdezign commented Feb 4, 2024

So far I've got this: recurrent replay buffer with overlapping chunks supporting SB3 interface. I also wrote a specification (test) to reduce future surprises.

https://gist.github.com/masterdezign/47b3c6172dd1624bb9a7ef23cbc79c8c

The limitation is n_envs = 1. This can be resolved in the future.

@masterdezign
Copy link
Author

Hi! I didn't obtain good results and then I had to put the project on hold. I plan to restart working on it starting from tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants