Skip to content
Sven Schmit edited this page Feb 18, 2017 · 4 revisions

MDPs

Creating an MDP

An MDP is a namedtuple with the following elements:

  • initial_states: list with initial states and their probabilities: [(probability, state)]
  • actions: function that takes a state and returns all actions: state -> [actions]
  • transitions: function that takes a state and action and returns all possible new states and rewards. that is, state, action -> [(probability, (new_state, reward))]
  • discount: discount factor between 0 and 1

Note, if new_state == None that indicates the MDP has reached a terminal node

Solving a version of Blackjack

from perl.mdp import value_iteration
from perl.mdp.blackjack import Blackjack

# create Blackjack MDP
blackj = Blackjack(list(range(10)) + ["A"])

# solve the MDP using value iteration
values, policy = value_iteration(blackj)

# print optimal value and policy
print("Value:")
print("state | value")
for state, value in sorted(values.items()):
    print("{}       {:.2f}".format(state, value))
    
print("Policy:")
print("state | action")
for state, action in sorted(policy.items()):
    print("{}         {:}".format(state, action))

Reinforcement learning

Posterior sampling

Here is an example on how to use Posterior sampling to solve an MDP

import numpy as np
import toyplot as tp

from perl.mdp.numberline import Numberline
from perl.rl.environment import mdp_to_env
from perl.rl.simulator import live
from perl.rl.algorithms import FixedPolicy, PosteriorSampling

from perl.bayesian import Beta

# create a small MDP we want to solve
mdp = Numberline(3)
env = mdp_to_env(mdp)

# Run posterior sampling for 100 episodes
PS = PosteriorSampling(mdp, d_reward=lambda: Beta(0.1, 0.1))
rewards = live(env, PS, 100, verbose=20)
performance = sum(rewards) / len(rewards)

# Evaluate policy
final_rewards = live(env, FixedPolicy(PS.optimal_policy), 1000)
final_performance = sum(final_rewards) / len(final_rewards)

# Plot rewards over time
canvas = tp.Canvas(500, 300)
axes = canvas.cartesian(label="Cumulative reward while learning", 
                        xlabel="episode (learning: {:.3f} - final: {:.3f})".format(performance, final_performance),
                        ylabel="reward")
axes.plot(np.cumsum(rewards))