[Feature Request] Implement Recurrent SAC #201

masterdezign · 2023-08-02T08:16:25Z

🚀 Feature

Hi!

I would like to implement a recurrent soft actor-critic. Is it a sensible contribution?

Motivation

I actually need this algorithm in my projects.

Pitch

The sb3 ecosystem would benefit from yet another algorithm. As a new contributor, I might need a little guidance though.

Alternatives

An alternative would be another off-policy algorithm using LSTM.

Additional context

No response

Checklist

I have checked that there is no similar issue in the repo
If I'm requesting a new feature, I have proposed alternatives

araffin · 2023-08-03T11:50:30Z

Hello,
this would be definitely a good addition to SB3 contrib.

Make sure to read the contributing guide carefully.
You might have a look at R2D2 paper (https://paperswithcode.com/method/r2d2) and https://github.com/zhihanyang2022/off-policy-continuous-control.

For benchmarking, best would be to use the "NoVel" env that are available in the RL Zoo (see https://wandb.ai/sb3/no-vel-envs/reports/PPO-vs-RecurrentPPO-aka-PPO-LSTM-on-environments-with-masked-velocity-SB3-Contrib---VmlldzoxOTI4NjE4).

masterdezign · 2023-08-07T11:15:15Z

Thanks for the references. I will check them out and come back.

masterdezign · 2023-10-07T10:25:22Z

Just a quick update: I plan to do this by the end of 2023 when I have some free time. Currently I have three higher priority projects.

masterdezign · 2023-12-21T21:38:21Z

Status update:

I've checked the resources that you provided, thanks a lot. I find the code to be nicely written and quite easy to understand.
I managed to solve PendulumNoVel-v1 from rl_zoo3==2.1.0 with RSAC.
However, I have trouble solving MountainCarContinuousNoVel-v0 and LunarLanderContinuousNoVel-v2 using the code above with different configurations.
Therefore, I may need to resort to modify the algorithm (such as e.g. using the same LSTM state by the actor and critics, using overlapping segments, etc.).
EDIT I've checked your benchmarks and realized that LunarLander may require more timesteps (it takes up to 5M for PPO LSTM).

masterdezign · 2023-12-28T12:58:39Z

I've got these results on LunarLanderContinuousNoVel-v2 (rl_zoo3==2.1.0) using RSAC with shared LSTM state (rsac_s) and RSAC. In both cases, the configuration was the same:

# ====================================================================================
# gin macros
# ====================================================================================

capacity = 1000
batch_size = 10
segment_len = 50

num_epochs = 500
num_steps_per_epoch = 10000
update_after = 10000
num_test_episodes_per_epoch = 10

# ====================================================================================
# applying the parameters
# ====================================================================================

import basics.replay_buffer_recurrent
import basics.run_fns

basics.replay_buffer_recurrent.RecurrentReplayBuffer.capacity = %capacity
basics.replay_buffer_recurrent.RecurrentReplayBuffer.batch_size = %batch_size
basics.replay_buffer_recurrent.RecurrentReplayBuffer.segment_len = %segment_len

basics.run_fns.train.num_epochs = %num_epochs
basics.run_fns.train.num_steps_per_epoch = %num_steps_per_epoch
basics.run_fns.train.num_test_episodes_per_epoch = %num_test_episodes_per_epoch
basics.run_fns.train.update_after = %update_after

It took about 20 hours to compute per run. Perhaps now this rsac_s architecture can be implemented in sb3-contrib.

araffin · 2024-01-15T15:36:57Z

Hello,
thanks for reporting the updated results =).
Do you have a diagram to share for RSAC vs RSAC_s maybe? (that would make things easier to discuss)

Di you also manage to solve the mountain car problem?

masterdezign · 2024-01-16T13:23:49Z

Di you also manage to solve the mountain car problem?

I believe, yes. Let me render the env to verify since rewards are not the same for MountainCarContinuousNoVel-v0 (continuous action space) and MountainCar-v0 (discrete action space).

masterdezign · 2024-01-16T13:32:41Z

Loosely speaking, here they are:


           RSAC                        RSAC_S

     ┌─────┐    ┌─────┐               ┌─────┐
     │ RNN │    │ RNN │             ┌─┤ RNN │..
     └──┬──┘    └──┬──┘             │ └─────┘ .
        │          │                │         .
        │          │                │         .
    ┌───┴───┐  ┌───┴────┐       ┌───┴───┐  ┌────────┐
    │ Actor │  │ Critic │       │ Actor │  │ Critic │
    └───────┘  └────────┘       └───────┘  └────────┘

As you can see, RSAC_S share the RNN state between the actor and the critic, but only actor can change the RNN state. Whereas in RSAC actor and critics have their own RNN states.

araffin · 2024-01-16T13:34:57Z

As you can see, RSAC_S share the RNN state between the actor and the critic, but only actor can change the RNN state. Whereas in RSAC actor and critics have their own RNN states.

thanks, similar to what is implemented for PPO:

stable-baselines3-contrib/sb3_contrib/common/recurrent/policies.py

Lines 238 to 247 in 588c6bd

    
           if self.lstm_critic is not None: 
        
               latent_vf, lstm_states_vf = self._process_sequence(vf_features, lstm_states.vf, episode_starts, self.lstm_critic) 
        
           elif self.shared_lstm: 
        
               # Re-use LSTM features but do not backpropagate 
        
               latent_vf = latent_pi.detach() 
        
               lstm_states_vf = (lstm_states_pi[0].detach(), lstm_states_pi[1].detach()) 
        
           else: 
        
               # Critic only has a feedforward network 
        
               latent_vf = self.critic(vf_features) 
        
               lstm_states_vf = lstm_states_pi

masterdezign · 2024-01-16T16:44:47Z

Update, I just rendered MountainCarContinuousNoVel-v0 and it is not solved yet. I don't quite understand why the total reward is different between the original MountainCar-v0 env and this one. Therefore, I need to check MountainCarContinuousNoVel-v0 (and MountainCarContinuous-v0) in details.

araffin · 2024-01-16T19:08:29Z

i can help you with that, the continuous version has a deceptive reward and need quite some exploration noise

EDIT: working hyperparameters: https://github.com/DLR-RM/rl-baselines3-zoo/blob/8cecab429726d7e6aaebd261d26ed8fc23b7d948/hyperparams/sac.yml#L2
or https://github.com/DLR-RM/rl-baselines3-zoo/blob/8cecab429726d7e6aaebd261d26ed8fc23b7d948/hyperparams/td3.yml#L5-L6

(note: the gSDE exploration is important there, otherwise a high OU noise would work too)

masterdezign · 2024-01-17T21:42:06Z

Thanks, I'll check those hyperparameters.

masterdezign · 2024-01-24T11:02:39Z

Indeed, having use_sde=True seems helping to solve MountainCarContinuous-v0 environment. I am curious which gSDE ingredient does exactly help.

Edit: I also tried nearby hyperparameters and indeed gSDE contribution seems to be non-negligible.

araffin · 2024-01-24T12:16:46Z

I am curious which gSDE ingredient does exactly help.

The consistent exploration. To solve this task, you need to build-up momentum, having a bang-bang like strategy is one way (it is discuss a bit longer in the first version of the paper: https://arxiv.org/pdf/2005.05719v1.pdf).

Edit: I also tried nearby hyperparameters and indeed gSDE contribution seems to be non-negligible.

I did a full hyperparameters search and with gSDE many are working (more than half of the tested configurations): https://github.com/DLR-RM/rl-baselines3-zoo/blob/sde/logs/report_sde_MountainCarContinuous-v0_500-trials-50000-tpe-median_1581693633.csv

masterdezign · 2024-01-30T15:22:49Z

I am currently checking the two strategies for RNN state initialization, proposed in R2D2 paper (store state and burn-in).

masterdezign · 2024-02-04T18:56:05Z

So far I've got this: recurrent replay buffer with overlapping chunks supporting SB3 interface. I also wrote a specification (test) to reduce future surprises.

https://gist.github.com/masterdezign/47b3c6172dd1624bb9a7ef23cbc79c8c

The limitation is n_envs = 1. This can be resolved in the future.

masterdezign · 2024-05-01T10:58:43Z

Hi! I didn't obtain good results and then I had to put the project on hold. I plan to restart working on it starting from tomorrow.

masterdezign added the enhancement New feature or request label Aug 2, 2023

araffin added the Maintainers on vacation Maintainers are on vacation so they can recharge their batteries, we will be back soon ;) label Dec 28, 2023

masterdezign mentioned this issue Dec 28, 2023

[Feature Request] Expand RNN Options and Algorithm Flexibility #220

Open

2 tasks

araffin mentioned this issue Jan 10, 2024

[Question] how to use "lstm_states" from rollout_buffer to reconstruct LSTM states during training #222

Closed

4 tasks

araffin removed the Maintainers on vacation Maintainers are on vacation so they can recharge their batteries, we will be back soon ;) label Jan 15, 2024

araffin mentioned this issue May 10, 2024

[Question] LSTM and SAC - Am I understanding the docs correctly? DLR-RM/stable-baselines3#1924

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Implement Recurrent SAC #201

[Feature Request] Implement Recurrent SAC #201

masterdezign commented Aug 2, 2023 •

edited

araffin commented Aug 3, 2023

masterdezign commented Aug 7, 2023

masterdezign commented Oct 7, 2023

masterdezign commented Dec 21, 2023 •

edited

masterdezign commented Dec 28, 2023 •

edited

araffin commented Jan 15, 2024

masterdezign commented Jan 16, 2024

masterdezign commented Jan 16, 2024

araffin commented Jan 16, 2024

masterdezign commented Jan 16, 2024 •

edited

araffin commented Jan 16, 2024 •

edited

masterdezign commented Jan 17, 2024

masterdezign commented Jan 24, 2024 •

edited

araffin commented Jan 24, 2024

masterdezign commented Jan 30, 2024

masterdezign commented Feb 4, 2024 •

edited

masterdezign commented May 1, 2024

[Feature Request] Implement Recurrent SAC #201

[Feature Request] Implement Recurrent SAC #201

Comments

masterdezign commented Aug 2, 2023 • edited

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

Checklist

araffin commented Aug 3, 2023

masterdezign commented Aug 7, 2023

masterdezign commented Oct 7, 2023

masterdezign commented Dec 21, 2023 • edited

masterdezign commented Dec 28, 2023 • edited

araffin commented Jan 15, 2024

masterdezign commented Jan 16, 2024

masterdezign commented Jan 16, 2024

araffin commented Jan 16, 2024

masterdezign commented Jan 16, 2024 • edited

araffin commented Jan 16, 2024 • edited

masterdezign commented Jan 17, 2024

masterdezign commented Jan 24, 2024 • edited

araffin commented Jan 24, 2024

masterdezign commented Jan 30, 2024

masterdezign commented Feb 4, 2024 • edited

masterdezign commented May 1, 2024

masterdezign commented Aug 2, 2023 •

edited

masterdezign commented Dec 21, 2023 •

edited

masterdezign commented Dec 28, 2023 •

edited

masterdezign commented Jan 16, 2024 •

edited

araffin commented Jan 16, 2024 •

edited

masterdezign commented Jan 24, 2024 •

edited

masterdezign commented Feb 4, 2024 •

edited