Skip to content

alexliyihao/infinite_horizon_policy_gradient

Repository files navigation

Bias and Variance Reduction in Off-policy Reinforcement Learning

Project 14, Columbia IEOR 2020 Summer Research Initiative

Supervisor: Dr. Henry Lam (Columbia IEOR Dept.)

Team Member: Yihao Li(yl4326@columbia.edu), Jierong Luo(jl5502@columbia.edu)

A new policy gradient estimation method inspired by Liu, et al.'s idea in 2018[1]. Focused on the (state, action, next state) triple's distribution in the trajectory rather than the trajectory itself. The new estimate's variance will not be affected by the time horizon in theory.(Most of the existed estimation process will suffer an exponential variance w.r.t. the time horizon)

New Estimation Method

In the experiment with RiverSwim problem, implemented the IHP(infinite horizon paper) method in two models, with roll-out time horizon = 15000: The parametric model controlled the bias in 10e2 order, the variance in 10e4 order.

Parametric Bias

Parametric Variance

Neural Network Model controlled the direction variance(measured by the trace of the covariance matrix of the normalized gradients) by about 0.3 (IHP 0.65 vs Baseline 0.9 to 1.0).

NN_variance

The off-policy optimization learning curve showed that IHP method has a significant advantage compared with all the benchmarks.

NN_Optimization_in_policy

NN_Optimization_off_policy

The parametric model one is implemented by Jierong Luo in Matlab, the Neural Network model one is implemented by Yihao Li in Python (Pytorch)

[1] Liu, Q., Li, L., Tang, Z., & Zhou, D. (2018). Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems (pp. 5356-5366). https://arxiv.org/pdf/1810.12429.pdf

About

An approach estimating the policy gradient in reinforcement learning, which is not affected by the time horizon (Columbia IEOR 2020 Summer Research Initiative)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published