Skip to content

Latest commit

 

History

History
40 lines (32 loc) · 2.93 KB

BACKGROUND.md

File metadata and controls

40 lines (32 loc) · 2.93 KB

A quick background review

Reinforcement Learning RL is a "a way of programming agents by reward and punishment without needing to specify how the task is to be achieved" Kaelbling, Littman, & Moore, 96.

The basic RL problem includes states (s), actions (a) and rewards (r). The typical formulation is as follows: captura de pantalla 2017-05-03 a las 14 58 28

The goal of RL is to select actions a to move around states s to maximize future reward r. Three key additional components in RL are:

  • policy (π): the policy is the agent's behavior, in other words, how to select action (a) in certain state (s)
  • Value function (V): prediction of the future reward or how much reward will I get from action a in state s
  • Model: representation of the environment, learnt from experience

Approaches to RL

  • Value-based RL: Estimate the optimal value function Q∗(s, a). This is the maximum value achievable under any policy
  • Policy-based RL: Search directly for the optimal policy π∗. This is the policy achieving maximum future reward
  • Model-based RL: Build a model of the environment. Plan (e.g. by lookahead) using model

There's also the actor-critic (e.g.: DDPG) techniques which learn both policies and value functions simultaneously.

Further classification of RL

Model based learning attempts to model the environment, and then based on that model, choose the most appropriate policy. Model-free learning attempts to learn the optimal policy in one step

  • On-policy methods:
    • attempt to evaluate or improve the policy that is used to make decisions,
    • often use soft action choice, i.e. π(s,a)>0,∀aπ(s,a)>0,∀a,
    • commit to always exploring and try to find the best policy that still explores,
    • may become trapped in local minima.
  • Off-policy methods:
    • evaluate one policy while following another, e.g. tries to evaluate the greedy policy while following a more exploratory scheme,
    • the policy used for behaviour should be soft,
    • policies may not be sufficiently similar,
    • may be slower (only the part after the last exploration is reliable), but remains more flexible if alternative routes appear.

Based on: