Code for the paper "BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning"(link: https://arxiv.org/abs/1910.12179), published at NeurIPS 2020.
The BAIL implementation can be found under spinup/algos/BAIL/
.
Code runs with pytorch >= 1.2, mujoco 150
Create a folder of the path "spinup/algos/BAIL/buffers" and put the RL batches there.
Then you run BAIL in the following procedure:
Under the BAIL folder, first do the returns calculation with python main_get_mcret.py
, which stores the calculated returns in "./results".
Then run BAIL with python main_static_bail.py
or run Progressive BAIL with python main_prog_bail.py
.
We use the logger of Spinup to record algorithmic performances. You may consult Spinup documentation for output and plotting:
(https://spinningup.openai.com/en/latest/user/saving_and_loading.html, https://spinningup.openai.com/en/latest/user/plotting.html). Or you could modify the code and save it in your style.
We now have an optimized version of the buffers, which are much smaller than the original dataset (but the content is the same, just in a different format), they can be downloaded here (about 28 GB in total):
https://drive.google.com/drive/folders/1pjHodByRE0UelFJ0e1cZ2oKhffIsxu3d?usp=sharing
You can load them with joblib, the loaded object is a dictionary, and then each entry is a NumPy array (sample code below).
import joblib
d = joblib.load('DDPG_Final_Sigma0.5_Walker2d-v2_b1.pkl')
for key in d:
print(key)
print(d[key].shape)
""" output:
observations
(1000000, 17)
next_observations
(1000000, 17)
actions
(1000000, 6)
rewards
(1000000,)
dones
(1000000,)
"""
Implementation of the BCQ algorithm: https://github.com/sfujim/BCQ
ENVIRONMENT | BAIL MEAN | BAIL STD |
---|---|---|
sigma=0.1 Hopper B1 | 2173 | 291 |
sigma=0.1 Hopper B2 | 2078 | 180 |
sigma=0.1 Walker B1 | 1125 | 113 |
sigma=0.1 Walker B2 | 3141 | 300 |
sigma=0.1 HC B1 | 5746 | 29 |
sigma=0.1 HC B2 | 7212 | 43 |
sigma=0.5 Hopper B1 | 2054 | 158 |
sigma=0.5 Hopper B2 | 2623 | 282 |
sigma=0.5 Walker B1 | 2522 | 51 |
sigma=0.5 Walker B2 | 3115 | 133 |
sigma=0.5 HC B1 | 1055 | 9 |
sigma=0.5 HC B2 | 7173 | 120 |
SAC Hopper B1 | 3296 | 105 |
SAC Hopper B2 | 1831 | 915 |
SAC Walker B1 | 2455 | 211 |
SAC Walker B2 | 4767 | 130 |
SAC HC B1 | 10143 | 77 |
SAC HC B2 | 10772 | 59 |
SAC Ant B1 | 4284 | 64 |
SAC Ant B2 | 4946 | 148 |
SAC Humanoid B1 | 3852 | 430 |
SAC Humanoid B2 | 3565 | 153 |
M sigma=0 Hopper B1 | 1026 | 0 |
M sigma=0 Hopper B2 | 696 | 233 |
M sigma=0 Walker B1 | 437 | 20 |
M sigma=0 Walker B2 | 500 | 12 |
M sigma=0 HC B1 | 4057 | 69 |
M sigma=0 HC B2 | 4013 | 12 |
M sigma=0 Ant B1 | 753 | 9 |
M sigma=0 Ant B2 | 738 | 4 |
M sigma=0 Humanoid B1 | 4313 | 139 |
M sigma=0 Humanoid B2 | 4053 | 252 |
M sigma=sigma(s) Hopper B1 | 375 | 52 |
M sigma=sigma(s) Hopper B2 | 254 | 102 |
M sigma=sigma(s) Walker B1 | 384 | 21 |
M sigma=sigma(s) Walker B2 | 512 | 24 |
M sigma=sigma(s) HC B1 | 4744 | 19 |
M sigma=sigma(s) HC B2 | 4123 | 19 |
M sigma=sigma(s) Ant B1 | 790 | 9 |
M sigma=sigma(s) Ant B2 | 781 | 6 |
M sigma=sigma(s) Humanoid B1 | 1375 | 387 |
M sigma=sigma(s) Humanoid B2 | 1309 | 372 |
O sigma=0 Hopper B1 | 2602 | 5 |
O sigma=0 Hopper B2 | 3046 | 34 |
O sigma=0 Walker B1 | 2735 | 26 |
O sigma=0 Walker B2 | 3019 | 6 |
O sigma=0 HC B1 | 11265 | 243 |
O sigma=0 HC B2 | 11360 | 265 |
O sigma=0 Ant B1 | 4901 | 65 |
O sigma=0 Ant B2 | 4975 | 108 |
O sigma=0 Humanoid B1 | 4872 | 895 |
O sigma=0 Humanoid B2 | 5320 | 125 |
O sigma=sigma(s) Hopper B1 | 2359 | 153 |
O sigma=sigma(s) Hopper B2 | 2035 | 217 |
O sigma=sigma(s) Walker B1 | 2834 | 120 |
O sigma=sigma(s) Walker B2 | 3200 | 16 |
O sigma=sigma(s) HC B1 | 10258 | 1255 |
O sigma=sigma(s) HC B2 | 10882 | 634 |
O sigma=sigma(s) Ant B1 | 4981 | 91 |
O sigma=sigma(s) Ant B2 | 5067 | 83 |
O sigma=sigma(s) Humanoid B1 | 2129 | 381 |
O sigma=sigma(s) Humanoid B2 | 4328 | 569 |
If you use our code, please cite our paper:
@inproceedings{chen2020bail,
author = {Chen, Xinyue and Zhou, Zijian and Wang, Zheng and Wang, Che and Wu, Yanqiu and Ross, Keith},
booktitle = {Advances in Neural Information Processing Systems},
editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
pages = {18353--18363},
publisher = {Curran Associates, Inc.},
title = {BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning},
url = {https://proceedings.neurips.cc/paper/2020/file/d55cbf210f175f4a37916eafe6c04f0d-Paper.pdf},
volume = {33},
year = {2020}
}