BDQN-PyTorch

Implementation of Efficient Exploration through Bayesian Deep-Q Networks.

Implementation in MxNet can be found here.

To see it in practice, update main.py with the desired experiment and run python main.py.

A Thompson Sampling of deep explorative RL

Bayesian deep Q networks (BDQNs) is an RL method which applies the function approximation capabilities of deep neural networks to problems in reinforcement learning. The model follows the work described in the paper Efficient Exploration through Bayesian Deep Q-Networks, written by Kamyar Azizzadenesheli, Emma Brunskil and, Anima Anankumar.

Summary of the algorithm:

The BDQN, in the sense of implementation, is same as DDQN, Deep Reinforcement Learning with Double Q-learning, written by Hado van Hasselt, except in the last layer, where instead of using linear regression as in DDQN, BDQN uses Bayesian Linear Regression (BLR) and for exploration, instead of using naive ε-greedy strategy as in DDQN, BDQN uses Thompson sampling and avoid any naive exploration.

As it is mentioned before, BDQN has the same architecture as DDQN has except, in BDQN we remove the last layer of DDQN. We call the output of the network as a representation φ(·), and instead assign BLR layer on the top of the representation. The input to the network is state of the environment, x and the output is φ(·), the feature representation. The input to BLR block is the feature representation.

BDQN Architecture

The input to the network part of BDQN is 4 × 84 × 84 tensor with a rescaled, mean-scale version of the last four observations. The first convolution layer has 32 filters of size 8 with a stride of 4. The second convolution layer has 64 filters of size 4 with stride 2. The last convolution layer has 64 filters of size 3 followed by a fully connected layers with size 512. We add a BLR layer on top of this.

BLR, a closed form way of computing posterior

In both DDQN (linear regression) and BDQN (Bayesian linear regression) the common assumptions are as follows:

The layer before the last layer provides features φ(·), suitable for linear models.
The generative model for state-action value, Q(x,a) is drawn from the following generative model: y ~ w_aφ(·) + ε where y is a sample of Q(x,a) and for simplicity we assume ε is a mean-zero Gaussian noise with variance σ_n².

The question in linear regression problem is given a bunch of (x,a,y), what w_a can be in term of minimizing least square error and the task is to find a w_a which matches x,a to y. In Bayesian machinery, we assume w_a is drawn from a prior distribution, e.g. mean-zero Gaussian distribution with varianceσ². Given data, the question in BLR is what is the posterior distribution of w_a which matches x,a to y. The interesting property of BLR is that given data, the distribution of w_a, therefore Q(a,x), can be computed in closed form and due to the conjugacy, the distribution over of samples of Q(a,x) has closed form.

Given this nice property, at each time step, we can compute the posterior distribution of Q-function. As Thompson Sampling based strategies suggest, we draw a Q-function out of the posterior distribution and act optimally with respect to that for that time step.

You can read more about the implementation of the BDQN in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
bdqn		bdqn
images		images
results		results
videos		videos
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bdqn

bdqn

images

images

results

results

videos

videos

.gitignore

.gitignore

README.md

README.md

main.py

main.py

Repository files navigation

BDQN-PyTorch

About

Releases

Packages

Languages

guptav96/BDQN-PyTorch

Folders and files

Latest commit

History

Repository files navigation

BDQN-PyTorch

About

Topics

Resources

Stars

Watchers

Forks

Languages