Skip to content

Latest commit

 

History

History
32 lines (27 loc) · 1.48 KB

README.md

File metadata and controls

32 lines (27 loc) · 1.48 KB

Behavior Proximal Policy Optimization

Author's Pytorch implementation of ICLR 2023 paper Behavior Proximal Policy Optimization (BPPO). BPPO uses the loss function from Proximal Policy Optimization (PPO) to improve the behavior policy estimated by behavior cloning.

The difference between BPPO and PPO

Compared to the loss function of PPO, BPPO does not introduce any extra constraint or regularization. The only difference is the advantage approximation, corresponding to the code difference between ppo.py line 88-89 and bppo.py line 151-155.

Overview of the Code

The code consists of 7 Python scripts and the file main.py contains various parameter settings which are interpreted and described in our paper.

Requirements

  • torch 1.12.0
  • mujoco 2.2.1
  • mujoco-py 2.1.2.14
  • d4rl 1.1

Running the code

  • python main.py: trains the network, storing checkpoints along the way.
  • Example:
python main.py --env hopper-medium-v2

Citation

If you use BPPO, please cite our paper as follows:

@article{zhuang2023behavior,
  title={Behavior proximal policy optimization},
  author={Zhuang, Zifeng and Lei, Kun and Liu, Jinxin and Wang, Donglin and Guo, Yilang},
  journal={arXiv preprint arXiv:2302.11312},
  year={2023}
}