Implemented CrossQ #243

danielpalen · 2024-05-04T11:24:29Z

This PR implements CrossQ (https://openreview.net/pdf?id=PczQtTsTIX), a novel off-policy deep RL algorithm that carefully uses batch normalisation and removes target networks to achieve state-of-the-art sample efficiency at a much lower computational complexity, as it does not require large update-to-data-ratios.

Description

This implementation is a PyTorch implementation based on the original JAX implementation (https://github.com/adityab/CrossQ).
The following plot shows that the performance matches the performance reported in the original paper, as well as the performance of the open source SBX implementation provided by the authors (evaluated on 10 seeds).

Context

I have raised an issue to propose this change (required)
closes [Feature Request] Implement CrossQ #238

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)

Checklist:

Note: we are using a maximum length of 127 characters per line

danielpalen · 2024-05-05T10:01:48Z

@araffin in my initial PR it seams like one code style check was failing, sorry about that. I fixed it and it passes on my machine now. I hope it will go through now :)

araffin · 2024-05-06T06:54:31Z

docs/modules/crossq.rst

+.. autosummary::
+    :nosignatures:
+
+    MlpPolicy


Could you add at least the multi input policy? (so we can try it in combination with HER)
Only the feature extractor should be changed normally.

And what do you think about adding CnnPolicy?

This is a good point. I looked into it and have not yet added it. If I am not mistaken this would also require some changes to the CrossQ train() function. Since, now concatenating and splitting the batches would also require some control flow based on the used policy.
For simplicity sake (for now) and since I did not have time to try and evaluate the multi input policy I did not add that yet.

docs/modules/crossq.rst

sb3_contrib/crossq/policies.py

araffin · 2024-05-06T07:01:35Z

sb3_contrib/crossq/policies.py

+        latent_pi_net = create_mlp(features_dim, -1, net_arch, activation_fn)
+
+        if batch_norm:
+            # If batchnorm, then we want to add torch.nn.Batch_Norm layers before every linear layer


What do you think about updating create_mlp to allow to pass normalization layer/dropout?

Similar to what is done in DLR-RM/stable-baselines3#1036 and proposed in DLR-RM/stable-baselines3#1069

I think this would make sense, because the way I implemented it right now is really that nice.

sb3_contrib/crossq/policies.py

araffin · 2024-05-06T07:04:51Z

sb3_contrib/crossq/crossq.py

+
+            with th.no_grad():
+                # Select action according to policy
+                self.actor.set_training_mode(False)


is that needed? self.actor.set_training_mode(False) is already set above?
or you meant self.actor.set_training_mode(True)?

I added more mode calls than needed. The reason was, that I wanted to be very specific which more needs to be used where. I think using the wrong BN mode is one of the big gotchas and sources of error when implementing CrossQ. Since this here should be a PyTorch reference to aid others when they want to implement it by themselves I think it is helpful to make the mode very specific to clear up possible confusion.

sb3_contrib/crossq/crossq.py

araffin · 2024-05-06T07:07:30Z

sb3_contrib/crossq/crossq.py

+            self.critic.optimizer.step()
+
+            # Compute actor loss
+            self.critic.set_training_mode(False)


Not needed but maybe for later, we should probably deactivate only the batchnorm? (for instance if dropout is used, we want it to be active there)

Same thinking as above. But you are right, if we want to additionally use dropout we should adapt this. Maybe we can just have a set_bn_training_mode function. However, that would be very specific for our use case.

sb3_contrib/crossq/crossq.py

sb3_contrib/common/network_layers.py

araffin · 2024-05-06T07:11:36Z

Thanks a lot for the implementation =)

I'll try later in the week, but how is it in term of runtime? (SAC vs CrossQ in PyTorch)

danielpalen · 2024-05-12T01:20:46Z

No worries :)

I just pushed most things you requested. I'll add some more specific responses directly to the questions above.

how is it in term of runtime? (SAC vs CrossQ in PyTorch)

It seems to be quite a but slower than the SAC baseline (and the JAX implementation as well).
for 4M steps, SAC HumanoidStandup took around 12 hours whereas CrossQ took 22 hours. Not sure if there are some PyTorch implementation details that could help with speed.

araffin · 2024-05-17T17:38:06Z

I'm suspecting something is wrong with the current implementation (I'm currently investigating if it is my changes or not).
My setting:

BipedalWalker-v3:
  n_timesteps: !!float 2e5
  policy: 'MlpPolicy'
  buffer_size: 300000
  gamma: 0.98
  learning_starts: 10000
  policy_kwargs: "dict(net_arch=dict(pi=[256, 256], qf=[1024, 1024]))"

With the RL Zoo cli for both SBX and SB3 (see SBX readme to have support)

python train.py --algo crossq --env BipedalWalker-v3 -P --verbose 0 -param n_envs:30 gradient_steps:30 -n 200000

I'm getting much better results with SBX...
I hope it is not the Adam parameters.

danielpalen · 2024-05-28T07:27:37Z

Did you figure out what the issue is? I was at ICRA until last week so I didn't have time but if you didn't find it yet I can also have a look.

Before I pushed my last commit I benchmarked it and there the results looked as expected.

Implemented CrossQ

9afecf5

araffin self-requested a review May 4, 2024 19:43

Fixed code style

4fa78a7