[question] Hyperparameters for Roboschool HumanoidFlagrunHarder #26

doviettung96 · 2019-06-05T07:42:02Z

Hi,
Currently, I have used 4 algorithms from stable-baselines for the task of Roboschool HumanoidFlagrunHarder. My evaluation metric is the mean reward of 100 episodes. Basically: PPO2 is perfect, A2C gets the mean reward of 500, DDPG gets the mean reward around 0. SAC gets the mean reward of 280. I have been looking for the hyperparameters setting in stable-baselines-zoo for A2C, DDPG, and SAC but could only find Bullet Env Humanoid for SAC (quite close to Roboschool HFH). Thus, do you have any suggestions for A2C, DDPG, SAC on this task? The number of timesteps for on-policy methods is 400M and 20M for off-policy methods. It would be nice if you add them to the set of hyperparameters.
Thanks.

araffin · 2019-06-05T07:53:36Z

My evaluation metric is the mean reward of 100 episodes.

You should do an evaluation on a test env after training, using deterministic=True (especially for SAC and DDPG).

PPO2 is perfect

What the magnitude of the reward for PPO2?

Thus, do you have any suggestions for A2C, DDPG, SAC on this task?

Well, I would suggest you to run hyperparameter tuning (as it is now included in the rl zoo), usually random sampling + median pruner works quite well (given enough budget, i usually use a budget of 1000 trials, use more if you can).

I agree that would be nice to have hyperparams for roboschool here (feel free to do a PR if you find good ones ;)). It is not the priority for now (focusing on improving stable-baselines) but will certainly do that in the future.

doviettung96 · 2019-06-05T08:22:35Z

@araffin ,
I have tested all of them after the training time without that condition. Actually, my code is very similar to the examples in stable-baselines. Maybe it is not quite correct for off-policy algorithms using deterministic policy.
For PPO2, I get the result around 1550 which is a little bit higher than the trained agent from Roboschool-1450 to 1500. I tested it 5 times and take the average. I would say it might be due to randomness.
Unfortunately, I would not have enough time to run even 100 trials because 400M of HFH normally takes me 2 days for on-policy methods (128 workers) and 20M takes 4-5 days for off-policy methods. That's why I need a prior.
Yeah. If I could get a high result, I would like to contribute it.

araffin · 2019-06-05T08:31:57Z

I would not have enough time to run even 100 trials

Usually, you don't run hyperparameter tuning on the full budget. You can try on one quarter of it, and because of the pruner, each trial won't use the max budget.

doviettung96 · 2019-06-05T10:40:25Z

I have never tried that. Hopefully, I could improve performance. Thank you.

doviettung96 · 2019-06-06T08:36:07Z

Hi @araffin ,
I just have a look at the predict function of PPO2, A2C, DDPG, SAC. For on-policy methods, the predict function has deterministic=False. On the other hand, deterministic=True by default for off-policy methods.
Thus, I think the low performance for off-policy methods might not relate to that setting. Anyway, I would also try to set it to be True for on-policy and see the result.
Also, for speeding up off-policy methods, you mentioned that I should use HER. So, should I use mpirun -n 8 in order to use 8 workers (for DDPG as in the paper)? Or you mean the other way of running?
Thank you.

araffin · 2019-06-06T08:47:48Z

Thus, I think the low performance for off-policy methods might not relate to that setting.

The predict method is only used for testing, for training, all policy are stochastics (don't forget to add noise for DDPG). And yes, the default value of deterministic differ from on-policy to off-policy methods, because in one case, you explicitly train a stochastic policy, where as in the other case, the noise in only here for the behavioral policy, for exploration during training.

you mentioned that I should use HER.

I meant you that you can look at the "HER-support" branch of this repo (which has a better support of mpi), not using HER.
So for DDPG, this works:

mpirun -n 16 python train.py --algo ddpg --env Pendulum-v0

doviettung96 · 2019-06-06T09:04:44Z

Thank you. I used Gaussian Noise for DDPG. I think I don't have any problem with the policy settings for training/ testing.
About the mpirun, I got it but I think I could also try HER anyway. According to the paper, they have used Mujoco env, I think I could use it for Roboschool.

doviettung96 · 2019-06-08T03:30:34Z

Anyway, I get the error of importing HER even I am using the docker image built from docker/Dockerfile.cpu. How could I fix it? I have no mean of using HER but the file utils.py import it from stable_baselines. Thanks.

araffin · 2019-06-08T08:22:16Z

HER is only in the master branch for now. It will be released soon (that id why the docker does not work yet), so you need to install SB from source.
Anyway, roboschool envs don t follow the goal interface, so it won t work.

doviettung96 · 2019-06-08T15:18:36Z

I have not looked at the change but is there any significant difference between master branch and HER-support branch about using MPI for DDPG (or SAC)? Or I need to install it from source?

doviettung96 · 2019-06-10T01:42:30Z

I would not have enough time to run even 100 trials

Usually, you don't run hyperparameter tuning on the full budget. You can try on one quarter of it, and because of the pruner, each trial won't use the max budget.

Hi @araffin ,
I don't fully understand this. For normal running, I run an experiment on RoboschoolHumanoidFlagrunHarder for 20M time steps. By "try on one-quarter of it", do you mean that I should use 1000 trials with 5M time steps? Or should I run 250 trials with 20M time steps?
About the pruner, yes I agree that it will stop the un-promising training earlier. This could save time.
Anyway, by running hyperparameter tuning, I could only use 1 environment for off-policy methods, right? How about the case of on-policy methods?
Thank you.

araffin added the question Further information is requested label Jun 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question] Hyperparameters for Roboschool HumanoidFlagrunHarder #26

[question] Hyperparameters for Roboschool HumanoidFlagrunHarder #26

doviettung96 commented Jun 5, 2019

araffin commented Jun 5, 2019

doviettung96 commented Jun 5, 2019

araffin commented Jun 5, 2019

doviettung96 commented Jun 5, 2019

doviettung96 commented Jun 6, 2019

araffin commented Jun 6, 2019

doviettung96 commented Jun 6, 2019

doviettung96 commented Jun 8, 2019

araffin commented Jun 8, 2019 •

edited

doviettung96 commented Jun 8, 2019

doviettung96 commented Jun 10, 2019

[question] Hyperparameters for Roboschool HumanoidFlagrunHarder #26

[question] Hyperparameters for Roboschool HumanoidFlagrunHarder #26

Comments

doviettung96 commented Jun 5, 2019

araffin commented Jun 5, 2019

doviettung96 commented Jun 5, 2019

araffin commented Jun 5, 2019

doviettung96 commented Jun 5, 2019

doviettung96 commented Jun 6, 2019

araffin commented Jun 6, 2019

doviettung96 commented Jun 6, 2019

doviettung96 commented Jun 8, 2019

araffin commented Jun 8, 2019 • edited

doviettung96 commented Jun 8, 2019

doviettung96 commented Jun 10, 2019

araffin commented Jun 8, 2019 •

edited