Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DQN: Action mask is not compatible in vectorized environments #186

Open
nargizsentience opened this issue Jan 29, 2024 · 3 comments
Open
Labels
bug Something isn't working

Comments

@nargizsentience
Copy link
Contributor

What version of AgileRL are you using?
v0.1.19

What operating system and processor architecture are you using?
Windows, 64-bit operating system, x64-based processor

What did you do?
I attempted to add vectorization to the self-play script to train DQN agent in PettingZoo AEC env. However, it seems like DQN's getAction assumes the usage of single action mask for all environments. It results in the mismatch between the shapes of mask and data fed into np.ma.array

Steps to reproduce the behaviour:

  1. Run the next reproduction script:
import numpy as np
from agilerl.algorithms.dqn import DQN

state_dim = [4]
action_dim = 2
one_hot = True

dqn = DQN(state_dim, action_dim, one_hot)
state = np.array([[1], [1]])
action_mask = np.array([[0, 1], [1, 0]])
epsilon = 1
action = dqn.getAction(state, epsilon, action_mask)
print(action)
  1. See error:
  File "C:\....\AgileRL\.venv\lib\site-packages\numpy\ma\core.py", line 2900, in __new__
    raise MaskError(msg % (nd, nm))
numpy.ma.core.MaskError: Mask and data not compatible: data size is 2, mask size is 4.

What did you expect to see?
A list of actions [1, 0]. Each action corresponds to a respective action mask and state.

What did you see instead? Describe the bug.
numpy.ma.core.MaskError: Mask and data not compatible: data size is 2, mask size is 4.

Additional context
The current getAction() seems to assume that action_mask is an 1D array, the size of which corresponds to the action_dim. It then samples n actions, where n is the number of observations (state.size()[0]). However, when the 'action_mask' is not an 1D array, the mask shape does not have the same shape as np.arange(0, self.action_dim).
I fixed this issue locally by modifying the getAction().

  1. I expand dimension if the action_mask.ndim == 1.
  2. Then randomly sample one action for each action mask.
@nargizsentience nargizsentience added the bug Something isn't working label Jan 29, 2024
@nicku-a
Copy link
Contributor

nicku-a commented Feb 12, 2024

Hey, how did you vectorize the environment? PZ doesn't offer any wrapper for vectorizing envs. We have one for parallel-API PZ envs. Could make a good contribution to the framework!

In terms of the action mask not working for vectorized envs, this is because it wasn't originally designed to, but with more clarity on how your vectorization works we can easily implement it I'm sure

@nargizsentience
Copy link
Contributor Author

Hi! I referenced your wrapper for PZ Parallel Env and implemented a wrapper for AEC API. It might not be fully vectorized because it waits until episodes in all environments are terminated or truncated (done) to start a new set of episodes. The main changes are:

  • I keep track of which environments are done in an array.
  • SubprocVecEnv's step() takes a list of actions and corresponding environment indices instead of sending actions to all environments to handle case when some episodes are done.
  • If worker() receives "last", it separates action_mask from observation and sends obs, action_mask, reward, terminated, truncated, info for all environments back. As I have a custom training loop, having separate obs and action_mask is not a problem but they can be put into a dict as well.
  • I update the array based on which envs are done
  • I extract the observations and action masks for environments that are not done and call agent's getAction to compute actions. action_mask[i] corresponds to state[i].

@gonultasbu
Copy link
Contributor

hey @nargizsentience, is there a particular reason for the environments not autoresetting themselves? I remember writing the parallelization with keeping autoresetting in mind so this is a personal question.

with your description, I don't think amending the getAction would be too hard and make it vectorization compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants