Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions Related to Multiagent Evaluation #186

Open
paehal opened this issue Jan 16, 2024 · 8 comments
Open

Questions Related to Multiagent Evaluation #186

paehal opened this issue Jan 16, 2024 · 8 comments
Labels
question Further information is requested

Comments

@paehal
Copy link

paehal commented Jan 16, 2024

I am currently testing a task involving two agents, each moving to a specified target location. To facilitate this, I've configured self.TARGET_POS in the init of MultiHoverAviary.py to set different target locations for each episode as follows,

col0_1 = np.random.choice([np.random.uniform(-0.75, -0.3), np.random.uniform(0.3, 0.75)], size=(num_drones, 2))
col2 = np.random.uniform(0.3, 0.75, size=(num_drones, 1))
result_array = np.hstack([col0_1, col2])
self.TARGET_POS = self.INIT_XYZS + result_array

Consequently, I have modified obs_12 in BaseRLAviary.py's _computeObs function to include a length=3 vector related to the target position, renaming it to obs_15.

Initially, I conducted training and inference with a single agent and confirmed successful learning towards the desired values. And, when training with multiagent (N=2) for the same task, the reward obtained during learning was approximately twice that of a single agent, suggesting nearly ideal learning.

The reward function is set as follows:

states = np.array([self._getDroneStateVector(i) for i in range(self.NUM_DRONES)])
ret = 0
for i in range(self.NUM_DRONES):
    ret += max(0, 3 - np.linalg.norm(self.TARGET_POS[i,:]-states[i][0:3])**4)
return ret

The issue arises in eval mode, where the system doesn't perform well. Specifically, the agents fail to approach the designated targets. Notably, agent 1 always performs better than agent 2.

After several debugging attempts, I suspect a few causes and would appreciate any insights:

  1. Model Save/Load Issue: I'm not fully versed in stablebaseline3's multiagent training, but I suspect the trained model might not be loading correctly. The file size of the saved model (policy) is the same for both single and multi-agent training. Could this mean that only one agent's model is being saved in the multiagent setup? Perhaps only agent 1's model is saved, and only agent 1's policy is operational upon loading?
  2. Mean Reward Calculation: Using evaluate_policy, the mean reward is about 3600, which is comparable to the trained values. Hence, I wonder if the problem lies not in evaluate_policy but in the following predict function:
action, _states = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, info = test_env.step(action)
  1. Observation Length Change: Since I've included the target position in the observation and changed the observation length to 15, there might be necessary changes I've overlooked due to this adjustment.

Any advice on these issues would be greatly appreciated. Thank you for your time and assistance.

@paehal
Copy link
Author

paehal commented Jan 17, 2024

I would like to provide some corrections and additional details related to my previous post about the multiagent experiment. To state the conclusion first, it appears that the problem I encountered with the multiagent setup also occurs in the single agent scenario.

An important detail I initially omitted is that I set ctrl_freq not to 30, but to 80. This might be a significant contributing factor to the issue. Could this change in ctrl_freq be affecting functions like sync? Any insights on this would be highly appreciated.

@JacopoPan JacopoPan added the question Further information is requested label Jan 21, 2024
@JacopoPan
Copy link
Member

Hi @paehal

w.r.t. 1. I don't think it's the issue but you might double check following SB3 instructions to visualize the default models

w.r.t. 2., I am a bit confused, if the evaluated policy is scoring the same as you saw in the training, that tends to rule out 1.. you also say that you are obtaining twice the reward of the single agent case that leads you to believe the learning should be successful but then that the same problem (what problem?) "also occurs in the single agent scenario"

in general, do not assume that a high reward certainly means the system is behaving how you desire, RL is known to "game" the simulation: are you sure that the high reward you see can be achieved IIF the drone move as/where you want?

observation length and control frequency CAN affect learning and control performance but they should not be "breaking" anything (but note that the number of steps/freq is proportional to the times you collect reward so it can change its value per episode)

@paehal
Copy link
Author

paehal commented Jan 29, 2024

Apologies for the delayed response, and thank you for your answer. I have figured out the cause of the issue. I was setting the target location for the drone movement in the init part of HoverAviary.py, but I was not aware that this only sets up as many environments as there are in parallel execution. I assumed that this target location would be updated every time I reset, which seems to be why the learning was not successful. I apologize for any inconvenience caused.

I have two additional questions related to this multiagent simulation. If you know, could you please enlighten me?

  1. When conducting a multiagent simulation, I understand that each agent is trained based on the obs set in the _computeObs function. I want to separate the observational information for each agent. How can I implement this? Specifically, I want to set it up so that the drone of agent No.0 cannot obtain the position information of agent No.1's drone, and vice versa for agent No.1's drone.

  2. Is there a way to display the trajectory of the drones when checking their behavior visually with gui=True during training?

I appreciate your help and look forward to your response.

@JacopoPan
Copy link
Member

w.r.t. to 1, I think that the easiest way is to modify the desired DRL agent in SB3 to have a collections of actor and policy networks (a pair for each agent) and simply slice the observation when training, recombine the action when predicting/testing (effectively, you have N independent RL problems and agents but note that the environment of each is no longer stationary).

w.r.t. 2, you should be able to simply force the GUI for the training environment (you can do it by changing the defaults in the constructors, for example) but it would lead to incredibly slow training, I am not sure it will work too well, especially with multiple agents.

@paehal
Copy link
Author

paehal commented Jan 29, 2024

w.r.t. to 1, I think that the easiest way is to modify the desired DRL agent in SB3 to have a collections of actor and policy networks (a pair for each agent) and simply slice the observation when training, recombine the action when predicting/testing (effectively, you have N independent RL problems and agents but note that the environment of each is no longer stationary).

Thank you for your response. Are you suggesting that we should set up multiple models and train them multiple times as you've described below? As I asked earlier, my understanding is that in the case of a simulation with multiagents, the same policy model is used for all agents. Therefore, if we set up different models for each agent, does it mean we need to train each of them separately? In any case, it seems like this would be a fairly complex modification, wouldn't it?


<For agent0>
    model_0 = PPO('MlpPolicy',
                train_env,
                # tensorboard_log=filename+'/tb/',
                verbose=1,
                batch_size=custom_batch_size,
                **custom_learning_params)

<For agent1>
    model_1 = PPO('MlpPolicy',
                train_env,
                # tensorboard_log=filename+'/tb/',
                verbose=1,
                batch_size=custom_batch_size,
                **custom_learning_params)

model_0.learn(total_timesteps=int(1e7) if local else int(1e2)
model_1.learn(total_timesteps=int(1e7) if local else int(1e2)

@JacopoPan
Copy link
Member

I would do the modification inside PPO, to create multiple independent networks operating on different parts of the obs and act vector of the environment, but yes, it requires to understand the SB3 implementation in a certain degree of depth.

@paehal
Copy link
Author

paehal commented Jan 29, 2024

Thanks, I'll ask the experts at stablebaseline3 github.

@paehal paehal closed this as completed Jan 29, 2024
@paehal
Copy link
Author

paehal commented Feb 2, 2024

I apologize for any confusion on my part, but I would like to clarify one thing. The _computeObs function returns the state of all agents, but does each agent make decisions based solely on their own information? Until now, I had assumed that each agent outputs actions based on the information of all agents.

@paehal paehal reopened this Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants