Deep Reinforcement Learning for delayed rewards problems

These experiments study Deep RL methods for Temporal Credit Assignment of delayed rewards in decision processes. Different environment configurations, reward strategies and model-free policy evaluation algorithms are compared. The experiments also contrast different neural architectures & loss functions for learning more 'immediate' rewards from delayed episodic rewards.

Introduction

Reinforcement Learning is one of the most promising areas in modern artificial intelligence. It revolves around training an agent to navigate an environment. The agent interacts with its surrounding by performing actions and getting rewards based on it. It continuously improves its behavioural policy to maximise its cumulative reward.
In many practical settings, the reward feedback for an action is not available immediately to the agent - it can be delayed by a few time steps. This is the problem of delayed rewards. Such decision processes are modelled using the generalised framework of a graph where an agent traverses over the vertices (which represent the states of the environment) by performing actions (which are represented by the edges in the graph). One or a few of the vertices in the graph can be associated with probabilistic rewards which get credited when the agent reaches them. There can also be long-term dependencies in this setting, wherein, say a state/action only yields reward when the agent has previously visited a specific state or performed some specific action. The accumulated reward from all these is however available only at the end of an episode. The challenge then is to develop efficient algorithms to analyse such delayed rewards and assign their credit to state-action pairs back in time throughout the episode. We propose to conduct biased random walks over the graph to sample episodes. These are then used in a model-free framework which learns from experience to decompose the accumulated reward and redistribute it among the state-action pairs in the episode, thus converting delayed rewards to more immediate ones. Each of the experiments is discussed in more detail in the subsequent sections.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Ex1.ipynb		Ex1.ipynb
Ex2.ipynb		Ex2.ipynb
Ex3.ipynb		Ex3.ipynb
Ex4.ipynb		Ex4.ipynb
Ex5.ipynb		Ex5.ipynb
Ex6_GNN.ipynb		Ex6_GNN.ipynb
Ex6_MLP.ipynb		Ex6_MLP.ipynb
Ex7.ipynb		Ex7.ipynb
Ex8.ipynb		Ex8.ipynb
Ex9.ipynb		Ex9.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ex1.ipynb

Ex1.ipynb

Ex2.ipynb

Ex2.ipynb

Ex3.ipynb

Ex3.ipynb

Ex4.ipynb

Ex4.ipynb

Ex5.ipynb

Ex5.ipynb

Ex6_GNN.ipynb

Ex6_GNN.ipynb

Ex6_MLP.ipynb

Ex6_MLP.ipynb

Ex7.ipynb

Ex7.ipynb

Ex8.ipynb

Ex8.ipynb

Ex9.ipynb

Ex9.ipynb

README.md

README.md

Repository files navigation

Deep Reinforcement Learning for delayed rewards problems

Introduction

About

Languages

matakshay/DeepRL-for-Delayed-Rewards

Folders and files

Latest commit

History

Repository files navigation

Deep Reinforcement Learning for delayed rewards problems

Introduction

About

Topics

Resources

Stars

Watchers

Forks

Languages