-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Sven Schmit edited this page Feb 18, 2017
·
4 revisions
An MDP is a namedtuple
with the following elements:
- initial_states: list with initial states and their probabilities: [(probability, state)]
- actions: function that takes a state and returns all actions: state -> [actions]
- transitions: function that takes a state and action and returns all possible new states and rewards. that is, state, action -> [(probability, (new_state, reward))]
- discount: discount factor between 0 and 1
Note, if new_state == None
that indicates the MDP has reached a terminal node
from perl.mdp import value_iteration
from perl.mdp.blackjack import Blackjack
# create Blackjack MDP
blackj = Blackjack(list(range(10)) + ["A"])
# solve the MDP using value iteration
values, policy = value_iteration(blackj)
# print optimal value and policy
print("Value:")
print("state | value")
for state, value in sorted(values.items()):
print("{} {:.2f}".format(state, value))
print("Policy:")
print("state | action")
for state, action in sorted(policy.items()):
print("{} {:}".format(state, action))
Here is an example on how to use Posterior sampling to solve an MDP
import numpy as np
import toyplot as tp
from perl.mdp.numberline import Numberline
from perl.rl.environment import mdp_to_env
from perl.rl.simulator import live
from perl.rl.algorithms import FixedPolicy, PosteriorSampling
from perl.bayesian import Beta
# create a small MDP we want to solve
mdp = Numberline(3)
env = mdp_to_env(mdp)
# Run posterior sampling for 100 episodes
PS = PosteriorSampling(mdp, d_reward=lambda: Beta(0.1, 0.1))
rewards = live(env, PS, 100, verbose=20)
performance = sum(rewards) / len(rewards)
# Evaluate policy
final_rewards = live(env, FixedPolicy(PS.optimal_policy), 1000)
final_performance = sum(final_rewards) / len(final_rewards)
# Plot rewards over time
canvas = tp.Canvas(500, 300)
axes = canvas.cartesian(label="Cumulative reward while learning",
xlabel="episode (learning: {:.3f} - final: {:.3f})".format(performance, final_performance),
ylabel="reward")
axes.plot(np.cumsum(rewards))