Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix -inconsistency of layers/net_arch usage in cnn policy between different algorithms #587

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

seheevic
Copy link

@seheevic seheevic commented Nov 28, 2019

Description

Motivation and Context

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)

Checklist:

  • I've read the CONTRIBUTION guide (required)
  • I have updated the changelog accordingly (required).
  • My change requires a change to the documentation.
  • I have updated the tests accordingly (required for a bug fix or a new feature).
  • I have updated the documentation accordingly.
  • I have ensured pytest and pytype both pass.

@araffin
Copy link
Collaborator

araffin commented Nov 28, 2019

Hello,
thanks for the PR, please fill the PR template completely.

@araffin araffin added the PR template not filled Please fill the pull request template label Nov 28, 2019
@seheevic
Copy link
Author

seheevic commented Nov 28, 2019

I'am not sure whether this is breaking change or not.
But if this PR is approved, newly created DQN model would be different from the previous version.
And there could be some inconsistency with legacy model.

@araffin araffin removed the PR template not filled Please fill the pull request template label Nov 28, 2019
@araffin
Copy link
Collaborator

araffin commented Nov 28, 2019

This is a breaking change, and I would change DDPG/SAC/TD3 for consistency then so we can fix #526

EDIT: layers should be [] by default in the case of a CNN and [64, 64] to match previous behavior.

@seheevic
Copy link
Author

seheevic commented Nov 28, 2019

@araffin
Are you saying that instead of this change, you are planning not to append layers after cnn_extractor for DDPG/SAC/TD3?
That would be a feasible option for now, but is it more plausible to guide users to implement/use cnn_extractor as feature extracting(literally) function?
I'm a little confused.

@Miffyli
Copy link
Collaborator

Miffyli commented Nov 28, 2019

I agree with @araffin : We should change DDPG/SAC/TD3 to match the behavior of other policies (DQN, A2C, PPO, etc). I.e. if cnn_extractor is given, that is the whole "feature extraction" network without any other additional layers before policy/value estimation. This way users only have to change the cnn_extractor part if they have images, and do not have to remember adjusting layers parameter.

@seheevic
Copy link
Author

seheevic commented Nov 29, 2019

@araffin @Miffyli
Ok, I got it.
I have just one opinion about this when changing DDPG/SAC/TD3 to ignore layers/net_arch.
I didn't fully understand those algorithms, but when it comes to DQN with dueling functionality, you have to allow some additional layers after both value-stream and advantage-stream because with current implementation we can only append layers for value-stream only as I described in the comment of the issue #526. (actually that's why I think it would be better to allow layers after cnn_extractor)
Anyway may I close this PR?

@Miffyli
Copy link
Collaborator

Miffyli commented Dec 1, 2019

You can close this PR and open a new one, given that the approach and purpose are different (fixing one algo vs. normalizing behavior of most algorithms).
On second thought you can continue this work on this PR so we have all discussion in one place. Just rename the PR to something more descriptive.

I do not understand what you mean "you have to allow adding layers after both advantage and value stream". Do you refer to how vf and pi are usually separated on the actor-critic algorithms in when defining network structure? Dueling DQN could use similar approach but this would require too much modifications with not enough benefit.

You can add layers after cnn_extractor by modifying the cnn_extractor itself. If layers also added layers to the network things would get unnecessarily convoluted in my opinion.

@seheevic
Copy link
Author

seheevic commented Dec 2, 2019

@Miffyli
When it comes to Dueling(enabled by default) DQN, let me explain more concretely.

In cnn policy, the network would be composed of 4 parts.
(1) cnn extractor
(2) V(s): state_score
(3) A(s,a): action_score
(4) Q(s,a): final aggregation of V(s) + A(s,a)

Defining the scope of cnn-extractor has some consequence for this dueling network what I've faced recently. As you can see from the code below, (1)-(3)-(4) is considered to be main network and (2) is an optional sub-network in dueling. And layers are appended to only V(s) in current implementation which is consistent to the conclusion so far (i.e. cnn extractor have to describe whole network so that layers are not appended to A(s,a) in main network). I guess (and I can see from my own test case) that A(s,a) needs to be more complex (or at least equal) function than V(s).

with tf.variable_scope("model", reuse=reuse):
with tf.variable_scope("action_value"):
if feature_extraction == "cnn":
extracted_features = cnn_extractor(self.processed_obs, **kwargs)
action_out = extracted_features
else:
extracted_features = tf.layers.flatten(self.processed_obs)
action_out = extracted_features
for layer_size in layers:
action_out = tf_layers.fully_connected(action_out, num_outputs=layer_size, activation_fn=None)
if layer_norm:
action_out = tf_layers.layer_norm(action_out, center=True, scale=True)
action_out = act_fun(action_out)
action_scores = tf_layers.fully_connected(action_out, num_outputs=self.n_actions, activation_fn=None)
if self.dueling:
with tf.variable_scope("state_value"):
state_out = extracted_features
for layer_size in layers:
state_out = tf_layers.fully_connected(state_out, num_outputs=layer_size, activation_fn=None)
if layer_norm:
state_out = tf_layers.layer_norm(state_out, center=True, scale=True)
state_out = act_fun(state_out)
state_score = tf_layers.fully_connected(state_out, num_outputs=1, activation_fn=None)
action_scores_mean = tf.reduce_mean(action_scores, axis=1)
action_scores_centered = action_scores - tf.expand_dims(action_scores_mean, axis=1)
q_out = state_score + action_scores_centered
else:
q_out = action_scores

So, about this question:

Do you refer to how vf and pi are usually separated on the actor-critic algorithms in when defining network structure? Dueling DQN could use similar approach but this would require too much modifications with not enough benefit.

Actually, I didn't mean that we have to be able to define V(s), A(s,a) separately. I wanted to point out at least we need to append layers to not only V(s) but also A(s,a) in Dueling DQN. If you are to totally ignore layers/net_arch after cnn_extractor, don't we need some other named parameter to this Dueling DQN? I agree that too much modification would cause more cost but I feel like that current implementation of Dueling DQN is somewhat degenerate structure with respect to CNN policy. (more layers like [256,256,256] does not increase the ability of A(s,a)). Of course we can fully describe the whole dueling network in custom cnn-extractor as you said, but than we cannot utilize this option that already has been implemented.

@seheevic seheevic changed the title Fix - After cnn features, flexible layers was not appended Fix -inconsistency of layers/net_arch usage w.r.t cnn policy between different algorithms Dec 2, 2019
@seheevic seheevic changed the title Fix -inconsistency of layers/net_arch usage w.r.t cnn policy between different algorithms Fix -inconsistency of layers/net_arch usage in cnn policy between different algorithms Dec 2, 2019
@Miffyli
Copy link
Collaborator

Miffyli commented Jan 2, 2020

Ah yes, I see the issue here. I do not think we should add extra layers only for one branch in the dueling architecture. Best thing is to follow the structure/diagrams of the original paper, where you feed output of CNN extractor to two different fully connected layers (V(s) and A(a, s)) and then do the combination of the two. I.e. for layer_size in layers: block needs to be removed, and instead you just map straight from extracted_features to state_score with one layer. You can include this change in this PR as it relates to whole layer/net_arch mismatch.

As for the other algorithms (DDPG/SAC/TD3): Follow this architecture of either only using cnn_extractor or layers/net_arch, but not both at the same time (i.e. no appending extra fully-connected layers after CNN extractor).

PS: Sorry for the long silence.

@seheevic
Copy link
Author

seheevic commented Jan 14, 2020

image
I need to make sure of what I understood. Do you mean that the output of CNN extractor for dueling DQN consists of two heads, one for V(s) and another for A(s,a) as the attached image above? The default CNN extractor has this signature nature_cnn(scaled_images, **kwargs) , and by means of kwargs it looks possible. But then, that says default/custom CNN extractors could return one head or could return two heads (or concatenated).

@Miffyli
Copy link
Collaborator

Miffyli commented Jan 14, 2020

For some reason I completely missed the layers you have highlighted in the figure (the 512-unit ones for each head), my bad ^^.

I think cnn_extractor should at most do one fully-connected layer (like it does now), but indeed it does not match the paper's implementation. But if we take the features straight from convolutions and add hardcoded layer sizes, it will be a headache for users who want to use different networks.

@araffin
Comment: True to the original implementation (conv -> two heads of 512 -> advantage and value) which is more complicated and may have hardcoded values, or current method (conv -> fc 512 -> advantage and value) which is cleaner/ to implement?

@araffin
Copy link
Collaborator

araffin commented Jan 15, 2020

which is cleaner/ to implement?

@Miffyli I don't have much time for that issue right now, I trust you to take the right decision ;) (unless you really want my point of view, in that case, you need to wait some time)

@Miffyli
Copy link
Collaborator

Miffyli commented Jan 15, 2020

@seheevic

I think we should deviate from paper a little bit by splitting the 512-unit output from nature_cnn directly into advantages and value without any extra layers. The other options include hardcoded values (ewww), and if we include effects of layers/net_arch we collide with the original purpose of this PR. I.e. state_score is just computed with just one mapping from extracted_features, just like action_scores are computed. This difference from original paper should be mentioned in the documentation, though.

@mmcenta
Copy link

mmcenta commented Mar 26, 2020

Hello, I just ran into this inconsistency while implementing a project this week and we basically copied the code from the FeedForwardPolicy from DQN to adapt the code for our need. I am free now and I can work on this issue if you guys want some help!

@Miffyli
Copy link
Collaborator

Miffyli commented Mar 26, 2020

@mmcenta

Which issue are you referring here, exactly? The confusion with how net_arch/layers behave with different algoriths, or the stuff related to dueling network architecture?

@mmcenta
Copy link

mmcenta commented Mar 27, 2020

@Miffyli Now that I read the issue a second time, I don't think it is worth it to change the behavior of the layers parameter now that v3 is on the oven.

What I wanted to tackle is the fact that the deepq policies do not implement the net_arch parameter. What do you think about using an approach similar to what is done is common/policies.py and updating the documentation?

@Miffyli
Copy link
Collaborator

Miffyli commented Mar 27, 2020

I agree that we should avoid bigger changes to current version. I am confused though: what is it exactly that net_arch could offer to DQN? I see the naming issue and the unclear behaviour (as discussed in the issues), but for DQN there is no need for e.g. separate networks.

@mmcenta
Copy link

mmcenta commented Mar 28, 2020

I can come up with three reasons for the change:

  • Uniformity: as I was working with a recent project, I used the Custom Policy documentation page as a reference. Quickly I realized that DQN policies did not conform to the documentation page, and I had to dig into the source code to figure out how to use the policies I needed.
  • Flexibility: as discussed in this PR and in this issue from keras-rl, one can find different ways to split the final layers of the DQNs depending on the implementation. I think that stripping CNN extractors of FC layers and implementing the net_arch parameter is a good way of accommodating both approaches.
  • Ease of use: if you want to control over the common layers on a DQN net, you would have to implement a custom CNN extractor and add the common FC layers. This is more cumbersome than simply passing the desired architecture via net_arch. I believe this is the issue @seheevic tried to address.

As for the usefulness of having this level of control, I will admit I'm not well suited to answer this question as I'm still inexperienced with Deep RL. I think that the use case of dueling architectures for deeper MLP policies would require it, but I don't know how common they are. If you believe it is not worth it, we could at least update the Custom Policies page.

@Miffyli
Copy link
Collaborator

Miffyli commented Mar 31, 2020

@mmcenta

Sorry for the late reply! Such messy times ^^

  1. The docs on custom policies should probably mention to which algorithms this applies. DQN page does note that (it uses its own policies)[https://stable-baselines.readthedocs.io/en/master/modules/dqn.html], not the common ones.
  2. (also 3) Personally, I would say having one place for the full network architecture (cnn_extractor, in the case of CNN policies) is more flexible. You can add and modify layers as you please in this one spot. In case of MLPPolicies, you still have layers parameter (but indeed, this was the inconsistent part, not being named net_arch

But I see the benefit of net_arch, though: In dueling architecture, the shared features are fed to two separate layers (of size N), which are then mapped to Q-like-values and state value. This is closer to actor-critics with policy and value networks, so in that sense it could be done with net_arch. However this will get messier with the option to disable/enable dueling architecture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants