Model-Ensemble Trust-Region Policy Optimization (ME-TRPO)

ME-TRPO is a deep model-based reinforcement learning algorithm that uses neural networks to model both the dynamics and the policy. The dynamics model maintains uncertainty due to limited data through an ensemble of models. The algorithm alternates among adding transitions to a replay buffer, optimizing the dynamics models given the buffer, and optimizing the policy given the dynamics models in Dyna's style. This algorithm significantly helps alleviating the model bias problem in model-based RL, when the policy exploits the error in the dynamics model. In many Mujoco domains, we show that it can achieve the same final performance as model-free approaches while using 100x less data. Here we assume that the reward function can be specified.

Set-up

Install rllab and conda.
Create a python environment and install dependencies conda env create -f tf14.yml.

Activate the environment source activate tf14.
Put this folder inside rllab/sandbox/thanard/me-trpo folder.
run python run_model_based_rl.py trpo -env swimmer.

Notes

Environments: swimmer, snake, half-cheetah, and hopper work reliably and converge quickly (in order of hours). ant and humanoid takes a couple days on a single GPU and are not as reliable.
Algorithms:trpo works better than vpg which works better than bptt.
To run snake, put vendor/mujoco_models/snake.xml under rllab/vendor/mujoco_models

Logging

The folder is saved in data/local/ENVNAME/ENVNAME_DATETIME_0001 when running without ec2(by default).
progress.csv contains real_current_validation_cost which is the negative of the reward so far.
info.log contains the full logs of data collection, dynamics model optimization, and policy optimization. Note that we are minimizing the proxy cost, estim_validation_cost. The true cost is shown as real_validation_cost, but unseen to the policy optimizer.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
algos		algos
envs		envs
params		params
samplers		samplers
vendor/mujoco_models		vendor/mujoco_models
LICENSE		LICENSE
README.md		README.md
env_helpers.py		env_helpers.py
model_based_rl.py		model_based_rl.py
namedtuples.py		namedtuples.py
run_model_based_rl.py		run_model_based_rl.py
running_mean_std.py		running_mean_std.py
svg_utils.py		svg_utils.py
tf14.yml		tf14.yml
training.py		training.py
utils.py		utils.py

License

thanard/me-trpo

Folders and files

Latest commit

History

Repository files navigation

Model-Ensemble Trust-Region Policy Optimization (ME-TRPO)

Set-up

Notes

Logging

About

Resources

License

Stars

Watchers

Forks

Languages