Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] What modifications do I need to mask the inputs similar to how MaskablePPO masks the outputs? #168

Open
4 tasks done
rllyryan opened this issue Mar 21, 2023 · 0 comments
Labels
question Further information is requested

Comments

@rllyryan
Copy link

rllyryan commented Mar 21, 2023

❓ Question

Hi sb3-contrib community,

Scenario
Currently, I have ran into an issue whereby I am not able to generalize the length of my observations (1-dimensional Box) and actions (Discrete), as my environment changes in size in accordance to the data that was provided to populate it. Some context, I am trying to solve a slotting optimization problem in warehouses, whereby different warehouse can have varying number of bins and items. As you might have imagined, the actions have swaps as actions and the number of actions can differ greatly between warehouses. As for the inputs, the length also depends on the number of bins and items as I am providing essential information pertaining to the demand of each item, distance of bin from packing zone, and etc.

I have looked pretty much everywhere (to no avail), but ChatGPT has suggested the use of padding. And this is how it goes:

image

  1. Set the maximum length of observation and action
  2. Populate the environment
  3. Pad the observation and action if necessary
  4. Use masking in the actions and observations so that these padded values do not affect the backpropagation

I found the apply_masking function too:

def apply_masking(self, masks: Optional[np.ndarray]) -> None:
        """
        Eliminate ("mask out") chosen categorical outcomes by setting their probability to 0.
        :param masks: An optional boolean ndarray of compatible shape with the distribution.
            If True, the corresponding choice's logit value is preserved. If False, it is set
            to a large negative value, resulting in near 0 probability. If masks is None, any
            previously applied masking is removed, and the original logits are restored.
        """

        if masks is not None:
            device = self.logits.device
            self.masks = th.as_tensor(masks, dtype=th.bool, device=device).reshape(self.logits.shape)
            HUGE_NEG = th.tensor(-1e8, dtype=self.logits.dtype, device=device)

            logits = th.where(self.masks, self._original_logits, HUGE_NEG)
        else:
            self.masks = None
            logits = self._original_logits

        # Reinitialize with updated logits
        super().__init__(logits=logits)

        # self.probs may already be cached, so we must force an update
        self.probs = logits_to_probs(self.logits)

Edit: I found this in the sb3_contrib/common/maskable/policies.py file, can the apply_masking() method be applied to the features that is produced by the self.extract_features() method?

def forward(
        self,
        obs: th.Tensor,
        deterministic: bool = False,
        action_masks: Optional[np.ndarray] = None,
    ) -> Tuple[th.Tensor, th.Tensor, th.Tensor]:
        """
        Forward pass in all the networks (actor and critic)
        :param obs: Observation
        :param deterministic: Whether to sample or use deterministic actions
        :param action_masks: Action masks to apply to the action distribution
        :return: action, value and log probability of the action
        """
        # Preprocess the observation if needed
        features = self.extract_features(obs)
        if self.share_features_extractor:
            latent_pi, latent_vf = self.mlp_extractor(features)
        else:
            pi_features, vf_features = features
            latent_pi = self.mlp_extractor.forward_actor(pi_features)
            latent_vf = self.mlp_extractor.forward_critic(vf_features)
        # Evaluate the values for the given observations
        values = self.value_net(latent_vf)
        distribution = self._get_action_dist_from_latent(latent_pi)
        if action_masks is not None:
            distribution.apply_masking(action_masks)
        actions = distribution.get_actions(deterministic=deterministic)
        log_prob = distribution.log_prob(actions)
        return actions, values, log_prob

Question/Request
This brought my attention to MaskablePPO whereby the masking of actions have already been implemented, as such, I would like to ask for advice as to how I would modify the algorithm to mask the inputs as well. Please provide filepaths to the respective source code for modifications too! Thank you :)

Remarks
If anyone has encountered this problem and solved it, could you share your insights as to how you do it? If anyone has any idea to generalized the inputs without the need of masking please do comment away!

Checklist

  • I have checked that there is no similar issue in the repo
  • I have read the documentation
  • If code there is, it is minimal and working
  • If code there is, it is formatted using the markdown code blocks for both code and stack traces.
@rllyryan rllyryan added the question Further information is requested label Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant