Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] Hyperparameters for Roboschool HumanoidFlagrunHarder #26

Open
doviettung96 opened this issue Jun 5, 2019 · 11 comments
Open
Labels
question Further information is requested

Comments

@doviettung96
Copy link

Hi,
Currently, I have used 4 algorithms from stable-baselines for the task of Roboschool HumanoidFlagrunHarder. My evaluation metric is the mean reward of 100 episodes. Basically: PPO2 is perfect, A2C gets the mean reward of 500, DDPG gets the mean reward around 0. SAC gets the mean reward of 280. I have been looking for the hyperparameters setting in stable-baselines-zoo for A2C, DDPG, and SAC but could only find Bullet Env Humanoid for SAC (quite close to Roboschool HFH). Thus, do you have any suggestions for A2C, DDPG, SAC on this task? The number of timesteps for on-policy methods is 400M and 20M for off-policy methods. It would be nice if you add them to the set of hyperparameters.
Thanks.

@araffin araffin added the question Further information is requested label Jun 5, 2019
@araffin
Copy link
Owner

araffin commented Jun 5, 2019

My evaluation metric is the mean reward of 100 episodes.

You should do an evaluation on a test env after training, using deterministic=True (especially for SAC and DDPG).

PPO2 is perfect

What the magnitude of the reward for PPO2?

Thus, do you have any suggestions for A2C, DDPG, SAC on this task?

Well, I would suggest you to run hyperparameter tuning (as it is now included in the rl zoo), usually random sampling + median pruner works quite well (given enough budget, i usually use a budget of 1000 trials, use more if you can).

I agree that would be nice to have hyperparams for roboschool here (feel free to do a PR if you find good ones ;)). It is not the priority for now (focusing on improving stable-baselines) but will certainly do that in the future.

@doviettung96
Copy link
Author

@araffin ,
I have tested all of them after the training time without that condition. Actually, my code is very similar to the examples in stable-baselines. Maybe it is not quite correct for off-policy algorithms using deterministic policy.
For PPO2, I get the result around 1550 which is a little bit higher than the trained agent from Roboschool-1450 to 1500. I tested it 5 times and take the average. I would say it might be due to randomness.
Unfortunately, I would not have enough time to run even 100 trials because 400M of HFH normally takes me 2 days for on-policy methods (128 workers) and 20M takes 4-5 days for off-policy methods. That's why I need a prior.
Yeah. If I could get a high result, I would like to contribute it.

@araffin
Copy link
Owner

araffin commented Jun 5, 2019

I would not have enough time to run even 100 trials

Usually, you don't run hyperparameter tuning on the full budget. You can try on one quarter of it, and because of the pruner, each trial won't use the max budget.

@doviettung96
Copy link
Author

I have never tried that. Hopefully, I could improve performance. Thank you.

@doviettung96
Copy link
Author

Hi @araffin ,
I just have a look at the predict function of PPO2, A2C, DDPG, SAC. For on-policy methods, the predict function has deterministic=False. On the other hand, deterministic=True by default for off-policy methods.
Thus, I think the low performance for off-policy methods might not relate to that setting. Anyway, I would also try to set it to be True for on-policy and see the result.
Also, for speeding up off-policy methods, you mentioned that I should use HER. So, should I use mpirun -n 8 in order to use 8 workers (for DDPG as in the paper)? Or you mean the other way of running?
Thank you.

@araffin
Copy link
Owner

araffin commented Jun 6, 2019

Thus, I think the low performance for off-policy methods might not relate to that setting.

The predict method is only used for testing, for training, all policy are stochastics (don't forget to add noise for DDPG). And yes, the default value of deterministic differ from on-policy to off-policy methods, because in one case, you explicitly train a stochastic policy, where as in the other case, the noise in only here for the behavioral policy, for exploration during training.

you mentioned that I should use HER.

I meant you that you can look at the "HER-support" branch of this repo (which has a better support of mpi), not using HER.
So for DDPG, this works:

mpirun -n 16 python train.py --algo ddpg --env Pendulum-v0

@doviettung96
Copy link
Author

Thank you. I used Gaussian Noise for DDPG. I think I don't have any problem with the policy settings for training/ testing.
About the mpirun, I got it but I think I could also try HER anyway. According to the paper, they have used Mujoco env, I think I could use it for Roboschool.

@doviettung96
Copy link
Author

Anyway, I get the error of importing HER even I am using the docker image built from docker/Dockerfile.cpu. How could I fix it? I have no mean of using HER but the file utils.py import it from stable_baselines. Thanks.

@araffin
Copy link
Owner

araffin commented Jun 8, 2019

HER is only in the master branch for now. It will be released soon (that id why the docker does not work yet), so you need to install SB from source.
Anyway, roboschool envs don t follow the goal interface, so it won t work.

@doviettung96
Copy link
Author

I have not looked at the change but is there any significant difference between master branch and HER-support branch about using MPI for DDPG (or SAC)? Or I need to install it from source?

@doviettung96
Copy link
Author

I would not have enough time to run even 100 trials

Usually, you don't run hyperparameter tuning on the full budget. You can try on one quarter of it, and because of the pruner, each trial won't use the max budget.

Hi @araffin ,
I don't fully understand this. For normal running, I run an experiment on RoboschoolHumanoidFlagrunHarder for 20M time steps. By "try on one-quarter of it", do you mean that I should use 1000 trials with 5M time steps? Or should I run 250 trials with 20M time steps?
About the pruner, yes I agree that it will stop the un-promising training earlier. This could save time.
Anyway, by running hyperparameter tuning, I could only use 1 environment for off-policy methods, right? How about the case of on-policy methods?
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants