From c633245f043c3c9239ed43725252b1c1caa27565 Mon Sep 17 00:00:00 2001
From: Frank Dellaert <dellaert@gmail.com>
Date: Sun, 28 Apr 2024 07:13:00 -0700
Subject: [PATCH] Move optimal policy to 3.6

---
 S35_vacuum_decision.ipynb | 508 +-------------------------------
 S36_vacuum_RL.ipynb       | 598 ++++++++++++++++++++++++++++++++++----
 2 files changed, 558 insertions(+), 548 deletions(-)
diff --git a/S35_vacuum_decision.ipynb b/S35_vacuum_decision.ipynb
index a2a9d686..f9686d89 100644
--- a/S35_vacuum_decision.ipynb
+++ b/S35_vacuum_decision.ipynb
@@ -116,10 +116,10 @@
     "\n",
     "In contrast, if the robot is able to know its location (using perception), it can act\n",
     "opportunistically when it reaches the hallway, and immediately move up.\n",
-    "The key idea here is that the optimal action at any moment in time depends\n",
+    "The key idea here is that the best action at any moment in time depends\n",
     "on the state in which the action is executed.\n",
     "The recipe of which action to execute in each state is called a *policy*,\n",
-    "and determining optimal policies is the main goal for this section."
+    "and defining policies and their associated *value function* is the main goal for this section."
    ]
   },
   {
@@ -674,7 +674,7 @@
        "</svg>\n"
       ],
       "text/plain": [
-       "<gtbook.display.show at 0x111493970>"
+       "<gtbook.display.show at 0x115892fd0>"
       ]
      },
      "execution_count": 9,
@@ -737,7 +737,7 @@
        "</div>"
       ],
       "text/plain": [
-       "<gtbook.display.pretty at 0x111493730>"
+       "<gtbook.display.pretty at 0x115893760>"
       ]
      },
      "execution_count": 11,
@@ -943,7 +943,7 @@
        "</div>"
       ],
       "text/plain": [
-       "<gtbook.display.pretty at 0x146a9b9a0>"
+       "<gtbook.display.pretty at 0x127f8e9a0>"
       ]
      },
      "execution_count": 16,
@@ -1002,8 +1002,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To implement optimal planning as a search for an optimal policy, we now derive an objective function to maximize.\n",
-    " Above, we defined the utility for a specific sequence of $n$ actions as\n",
+    "We would now like to characterize the quality of any given policy.\n",
+    "Above, we defined the utility for a specific sequence of $n$ actions as\n",
     "\n",
     "$$\n",
     "U(a_1, \\dots, a_n, x_1, \\dots x_{n+1}) =\n",
@@ -1230,7 +1230,7 @@
     "Collecting the unknown $V^\\pi$ terms on the left hand side and the known $\\bar{R}(x,\\pi(x))$ \n",
     "terms on the right hand side, we obtain\n",
     "\n",
-    "$$V^\\pi(x) - \\gamma \\sum_{x'} P(x'|x, \\pi(x)) V^\\pi(x') = \\bar{R}(x,\\pi(x))$$\n",
+    "$$V^\\pi(x) - \\gamma \\sum_{x'} P(x'|x, \\pi(x)) V^\\pi(x') = \\bar{R}(x,\\pi(x)).$$\n",
     "\n",
     " \n",
     "To make this explicit yet concise for our vacuum cleaning robot example,\n",
@@ -1406,490 +1406,6 @@
     "Why is the value function in the living room $100$ and not the immediate reward $10$?"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Optimal Policy and Value Function\n",
-    "\n",
-    "> The optimal policy maximizes the value function.\n",
-    "\n",
-    "Now that we know how to compute the value function for an arbitrary policy $\\pi$,\n",
-    "we turn our attention to computing the **optimal value function**,\n",
-    "which can be used to construct the **optimal policy** $\\pi^*$.\n",
-    "\n",
-    "To begin, we recall the famous **principle of optimality**\n",
-    "as stated by Bellman in a\n",
-    "[1960 article in the IEEE Transactions on Automatic Control](https://doi.org/10.1109/TAC.1960.6429288):\n",
-    "\n",
-    "* *An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.*\n",
-    "\n",
-    "This principle enables a key step in the derivation of a recursvie formulation for the optimal policy."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The optimal value function  $V^*: {\\cal X} \\rightarrow {\\cal A}$\n",
-    "is merely the value function for the optimal policy.\n",
-    "This can be written mathematically as\n",
-    "\n",
-    "$$\n",
-    "\\begin{aligned}\n",
-    "V^*(x) &= \\max_\\pi V^{\\pi}(x) \\\\\n",
-    "&=\n",
-    "\\max_\\pi \\left\\{ \\bar{R}(x,\\pi(x)) + \\gamma \\sum_{x'} P(x'|x, \\pi(x)) V^\\pi(x')   \\right\\}\\\\\n",
-    "&=\n",
-    "\\max_\\pi \\left\\{ \\bar{R}(x,\\pi(x)) + \\gamma \\sum_{x'} P(x'|x, \\pi(x)) V^*(x')   \\right\\}\\\\\n",
-    "&=\n",
-    "\\max_a  \\left\\{ \\bar{R}(x,a) + \\gamma \\sum_{x'} P(x'|x, a) V^*(x')   \\right\\} \\\\\n",
-    "\\end{aligned}\n",
-    "$$\n",
-    "\n",
-    "In the above, the second line follows immediately by substituting\n",
-    "our earlier expression for $V^\\pi$ into the maximization.\n",
-    "The third line is more interesting.\n",
-    "By applying the principle of optimality, we replace $V^\\pi(x')$ with $V^*(x')$.\n",
-    "Simply put, if remaining decisions from state $x'$ must constitute an optimal policy,\n",
-    "the corresponding value function at $x'$ will be the optimal value function for $x'$.\n",
-    "For the fourth line,\n",
-    "because the value function has been written in recursive form,\n",
-    "$\\pi$ is only applied to the current state (i.e., when $\\pi$ is evaluated in the optimization,\n",
-    "it always appears as $\\pi(x)$).\n",
-    "Therefore, we can write the optimization\n",
-    "as a maximization with respect to the *action* applied in the *current state*, rather than as a\n",
-    "maximization with respect to the entire policy $\\pi$!\n",
-    "\n",
-    "\n",
-    "This equation is known as the **Bellman equation**.\n",
-    "It is named after Richard Bellman, the mathematician\n",
-    "who discovered it, and it is one of the most important equations in all of computer science.\n",
-    "The Bellman equation has a very nice interpretation: \n",
-    "the optimal value function of a state is the maximum expected reward \n",
-    "*plus* the discounted expected value function when acting optimally in the future."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Using Bellman's equation, it is straightforward to compute the optimal policy from a given state.\n",
-    "\n",
-    "$$\n",
-    "\\pi^*(x) = \\arg\n",
-    "\\max_a  \\left\\{ \\bar{R}(x,a) + \\gamma \\sum_{x'} P(x'|x, a) V^*(x')   \\right\\} \n",
-    "$$\n",
-    "\n",
-    "This computation is performed so often that it is convenient to introduce the so-called $Q$-function\n",
-    "\n",
-    "$$\n",
-    "\\begin{aligned}\n",
-    "Q^*(x,a) \\doteq \\bar{R}(x,a) + \\gamma \\sum_{x'} P(x'|x, a) V^*(x') \n",
-    "\\end{aligned}\n",
-    "$$\n",
-    "\n",
-    "which allows us to write the optimal policy as\n",
-    "\n",
-    "\n",
-    "$$\n",
-    "\\pi^*(x) = \\arg\n",
-    "\\max_a  Q^*(x,a)\n",
-    "$$\n",
-    "\n",
-    "We will see the $Q$-function again, when we discuss reinforcement learning.\n",
-    "The code for computing a Q-value from a value function is given below:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 22,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def Q_value(value_function, x, a, gamma=0.9):\n",
-    "    \"\"\"Calculate Q(x,a) from given value function\"\"\"\n",
-    "    return T[x,a] @ (R[x,a] + gamma * value_function)\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We will describe two methods for determining the optimal policy.\n",
-    "The first method, policy iteration, iteratively improves candidate policies,\n",
-    "ultimately converging to the optimal policy $\\pi^*$.\n",
-    "The second method, value iteration, iteratively improves an estimate of $V^*$,\n",
-    "ultimately converging to the optimal value function."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Policy Iteration"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "> By iteratively improving an estimate of the optimal policy, we eventually find $\\pi^*$.\n",
-    "\n",
-    "One way to compute an optimal policy is to start with an initial guess\n",
-    "at the optimal policy, and then iteratively improve our guess until no futher improvements are possible.\n",
-    "This is exactly the approach taken by **policy iteration**.\n",
-    "In particular, policy iteration generates a sequence of policies\n",
-    "$\\pi^0, \\pi^1, \\dots \\pi^n$, such that $\\pi^{i+1}$ is better than policy $\\pi^i$.\n",
-    "This process ends when no further improvement is possible, which\n",
-    "occurs when $\\pi^{i+1} = \\pi^i.$\n",
-    "\n",
-    "To improve the policy $\\pi^i$, we update the action chosen *for each state* by applying\n",
-    "Bellman's equation using $\\pi^i$ in place of $\\pi^*$.\n",
-    "The can be achieved with the following algorithm:\n",
-    "\n",
-    "Start with a random policy $\\pi^0$ and $i=0$, and repeat until convergence:\n",
-    "1. Compute the value function $V^{\\pi^i}$\n",
-    "2. Improve the policy for each state $x \\in {\\cal X}$ using the update rule: \n",
-    "\n",
-    "$$\n",
-    "\\pi^{i+1}(x) \\leftarrow\\arg \\max_a \\sum_{x'} \\{P(x'|x, a) \\{R(x, a, x') + \\gamma V^i(x')\\}\n",
-    "$$\n",
-    "\n",
-    "3. Increment $i$\n",
-    "\n",
-    "Notice that this algorithm has the side benefit of computing \n",
-    "successively better approximations to the value function at each iteration.\n",
-    "Because there are a finite number of actions that can be applied in each state, there are only finitely many ways to update\n",
-    "a policy. Therefore, we expect this policy iteration algorithm to converge in finite time."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We already know how to do step one, via `calculate_value_function`. The second step of the algorithm is easily\n",
-    "implemented with the following code:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 23,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def update_policy(value_function):\n",
-    "    \"\"\"Update policy given a value function\"\"\"\n",
-    "    new_policy = [None for _ in range(5)]\n",
-    "    for x, room in enumerate(vacuum.rooms):\n",
-    "        Q_values = [Q_value(value_function, x, a) for a in range(4)]\n",
-    "        new_policy[x] = np.argmax(Q_values)\n",
-    "    return new_policy\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The whole policy iteration algorithm then simply iterates these until the policy no longer changes. If no initial policy is given, we can\n",
-    "start with a zero value function\n",
-    "$V^{\\pi^0}(x) = 0$ for all $x$:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 24,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def policy_iteration(pi=None, max_iterations=100):\n",
-    "    \"\"\"Do policy iteration, starting from policy `pi`.\"\"\"\n",
-    "    for _ in range(max_iterations):\n",
-    "        value_for_pi = calculate_value_function(pi) if pi is not None else np.zeros((5,))\n",
-    "        new_policy = update_policy(value_for_pi)\n",
-    "        if new_policy == pi:\n",
-    "            return pi, value_for_pi\n",
-    "        pi = new_policy\n",
-    "    raise RuntimeError(\"No stable policy found after {max_iterations} iterations\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "On the other hand, if we have a guess for the initial policy, we can intialize\n",
-    "$\\pi^0$ accordingly.\n",
-    "For example, we can start with a not-so-smart `always_right` policy:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 25,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "RIGHT = vacuum.action_space.index(\"R\")\n",
-    "\n",
-    "always_right = [RIGHT, RIGHT, RIGHT, RIGHT, RIGHT]\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 26,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "['L', 'L', 'R', 'U', 'U']\n"
-     ]
-    }
-   ],
-   "source": [
-    "optimal_policy, optimal_value_function = policy_iteration(always_right)\n",
-    "print([vacuum.action_space[a] for a in optimal_policy])\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Starting with the `always_right` policy, our policy iteration algorithm converges to an\n",
-    "intuitively pleasing policy.\n",
-    "In the dining room and kitchen we go `left`, in the office we go `right`, and in the hallway and dining room we go `up`.\n",
-    "This is significantly different from the `always_right` policy (which might be better named `almost_always_wrong`).\n",
-    "In fact, it is exactly the `reasonable_policy` that we created above.\n",
-    "We already knew that it should be pretty good at getting to the living room as fast as possible. In fact, it is optimal!\n",
-    "\n",
-    "We also print out the optimal value function below, which shows that if we are close to the living room the value function is very high, but it is a bit lower in the office in the dining room:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 27,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "  Living Room : 100.00\n",
-      "  Kitchen     : 97.56\n",
-      "  Office      : 85.66\n",
-      "  Hallway     : 97.56\n",
-      "  Dining Room : 85.66\n"
-     ]
-    }
-   ],
-   "source": [
-    "for i,room in enumerate(vacuum.rooms):\n",
-    "    print(f\"  {room:12}: {optimal_value_function[i]:.2f}\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The optimal policy is also obtained when we start without a policy, starting with a zero value function:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 28,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "['L', 'L', 'R', 'U', 'U']\n"
-     ]
-    }
-   ],
-   "source": [
-    "optimal_policy, _ = policy_iteration()\n",
-    "print([vacuum.action_space[a] for a in optimal_policy])\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Just to be sure, let us sanity check the solution above using the Monte Carlo estimate of the policy, which should give the same answer:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 29,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "V(Living Room) ~ 100.00\n",
-      "V(Kitchen) ~ 97.93\n",
-      "V(Office) ~ 83.87\n",
-      "V(Hallway) ~ 97.93\n",
-      "V(Dining Room) ~ 85.84\n"
-     ]
-    }
-   ],
-   "source": [
-    "nr_samples = 100\n",
-    "horizon = 100\n",
-    "X = VARIABLES.discrete_series('X', range(1, horizon+1), vacuum.rooms)\n",
-    "A = VARIABLES.discrete_series('A', range(1, horizon), vacuum.action_space)\n",
-    "for x1, room in enumerate(vacuum.rooms):\n",
-    "    V_x1 = approximate_value_function(x1, optimal_policy, nr_samples, horizon)\n",
-    "    print(f\"V({room}) ~ {V_x1:.2f}\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "These values are remarkably similar to the exact values computed above!"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Value Iteration\n",
-    "\n",
-    "> Dynamic programming can be used to obtain the optimal value function.\n",
-    "\n",
-    "Recall Bellman's equation, which must hold for each state $x$.\n",
-    "\n",
-    "$$\n",
-    "V^*(x) = \\max_a  \\left\\{ \\bar{R}(x,a) + \\gamma \\sum_{x'} P(x'|x, a) V^*(x')   \\right\\} \n",
-    "$$\n",
-    "\n",
-    "Sadly, this is not a linear equation (the maximization operation is not linear), so we cannot solve this\n",
-    "equation for $V^*$ as a system of linear equations.\n",
-    "**Value iteration** approximates $V^*$ by constructing a sequence of estimates,\n",
-    "$V^0, V^1, \\dots , V^n$ that converges to $V^*$.\n",
-    "Starting with an initial guess, $V^0$, at each iteration we update\n",
-    "our approximation of the value function for each state by the update rule:\n",
-    "\n",
-    "$$\n",
-    "V^{i+1}(x) \\leftarrow \\max_a \\left\\{ \\bar{R}(x,a) + \\gamma \\sum_{x'} P(x'|x, a) V^i(x')   \\right\\} \n",
-    "$$\n",
-    "\n",
-    "Notice that the right hand side includes two terms:\n",
-    "the expected reward (which we can compute exactly), and a term in $V^i$ (our current best guess at the value function).\n",
-    "Value iteration operates by iteratively using our *current best guess* of $V^*$ along with the *known* expected reward\n",
-    "to update the approximation.\n",
-    "Unlike policy iteration, we do not expect value iteration to converge to the exact result in finite time.\n",
-    "Therefore, we cannot use $V^{i+1} = V^i$ as our termination condition.\n",
-    "Instead, we often use a condition such as $|V^{i+1} - V^i| < \\epsilon$, for some small value of $\\epsilon$\n",
-    "as the termination condition.\n",
-    "\n",
-    "Finally, note that we can define Q values for the $k^{th}$ iteration as\n",
-    "\n",
-    "$$\n",
-    "Q(x, a; V^i) \\doteq \\bar{R}(x,a) + \\gamma \\sum_{x'} P(x'|x, a) V^i(x'),\n",
-    "$$\n",
-    "\n",
-    "and hence a value update is simply\n",
-    "\n",
-    "$$\n",
-    "V^{i+1}(x) \\leftarrow \\max_a Q(x, a; V^i).\n",
-    "$$\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In code, this is actually easier than policy iteration:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 30,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[100.  98.  90.  98.  90.]\n",
-      "[100.    97.64  86.76  97.64  86.76]\n",
-      "[100.    97.58  85.92  97.58  85.92]\n",
-      "[100.    97.56  85.72  97.56  85.72]\n",
-      "[100.    97.56  85.68  97.56  85.68]\n",
-      "[100.    97.56  85.67  97.56  85.67]\n",
-      "[100.    97.56  85.66  97.56  85.66]\n",
-      "[100.    97.56  85.66  97.56  85.66]\n",
-      "[100.    97.56  85.66  97.56  85.66]\n",
-      "[100.    97.56  85.66  97.56  85.66]\n"
-     ]
-    }
-   ],
-   "source": [
-    "V_k = np.full((5,), 100)\n",
-    "for k in range(10):\n",
-    "    Q_k = np.sum(T * (R + 0.9 * V_k), axis=2) # 5 x 4\n",
-    "    V_k = np.max(Q_k, axis=1) # max over actions\n",
-    "    print(np.round(V_k,2))\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Compare with optimal value function:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 31,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[100.    97.56  85.66  97.56  85.66]\n"
-     ]
-    }
-   ],
-   "source": [
-    "print(np.round(optimal_value_function, 2))\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "And we can easily *extract* the optimal policy:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 32,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "policy = [0 0 1 2 0]\n",
-      "['L', 'L', 'R', 'U', 'L']\n"
-     ]
-    }
-   ],
-   "source": [
-    "Q_k = np.sum(T * (R + 0.9 * V_k), axis=2)\n",
-    "pi_k = np.argmax(Q_k, axis=1)\n",
-    "print(f\"policy = {pi_k}\")\n",
-    "print([vacuum.action_space[a] for a in pi_k])\n"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -1906,13 +1422,11 @@
     "- The value function, $V^\\pi:{\\cal X} \\rightarrow \\mathbb{R}$, associated with a given policy $\\pi$.\n",
     "- The use of policy rollouts to approximate the value function $V^\\pi$.\n",
     "- Exact calculation of the value function for a fixed policy.\n",
-    "- The optimal policy and value function, governed by the Bellman equation.\n",
-    "- Two algorithms to compute those: policy iteration and value iteration.\n",
     "\n",
-    "There are two important extensions to MDPs that are not covered in this section:\n",
+    "We'll leave the concepts of an optimal policy and how to compute it for the next section. There are two other extensions to MDPs that we did not cover:\n",
     "\n",
     "- Partially Observable MDPs (or POMDPS) are appropriate when we cannot directly observe the state.\n",
-    "- Reinforcement learning, a way to learn MDP policies from *data*. This will be covered next. "
+    "- Reinforcement learning, a way to learn MDP policies from *data*. This will be covered in the next section as well."
    ]
   }
  ],
@@ -1938,7 +1452,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.18"
+   "version": "3.9.19"
   },
   "latex_metadata": {
    "affiliation": "Georgia Institute of Technology",
diff --git a/S36_vacuum_RL.ipynb b/S36_vacuum_RL.ipynb
index 82cf5393..d395392c 100644
--- a/S36_vacuum_RL.ipynb
+++ b/S36_vacuum_RL.ipynb
@@ -71,7 +71,7 @@
     "for assignment, value in conditional.enumerate():\n",
     "    x, a, y = assignment[0], assignment[1], assignment[2]\n",
     "    R[x, a, y] = 10.0 if y == vacuum.rooms.index(\"Living Room\") else 0.0\n",
-    "    T[x, a, y] = value\n"
+    "    T[x, a, y] = value"
    ]
   },
   {
@@ -81,27 +81,110 @@
     "id": "nAvx4-UCNzt2"
    },
    "source": [
-    "# Reinforcement Learning\n",
+    "# Learning to Act Optimally\n",
     "\n",
-    "> We will talk about model-based and model-free learning.\n",
+    "> Learning to act optimally in a stochastic world.\n",
     "\n",
-    "<img src=\"Figures3/S36-iRobot_vacuuming_robot-04.jpg\" alt=\"Splash image with intelligent looking robot\" width=\"40%\" align=center style=\"vertical-align:middle;margin:10px 0px\">"
+    "<img src=\"Figures3/S36-iRobot_vacuuming_robot-04.jpg\" alt=\"Splash image with intelligent looking robot\" width=\"40%\" align=center style=\"vertical-align:middle;margin:10px 0px\">\n",
+    "\n",
+    "When a Markov Decision Process is fully specified we can *compute* an optimal policy.\n",
+    "Below we first define optimal value functions and examine its properties, most notably the Bellman equation.\n",
+    "We then discuss value iteration and policy iteration, two algorithms to calculate the optimal value function and its associated optimal policy. However, both these algorithms need a fully-defined MDP.\n",
+    "\n",
+    "When the MPD is not known in advance, however, we have to *learn* an optimal policy over time. There are two main approaches: model-based and model-free."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Exploring to get Data\n",
+    "## The Optimal Value Function\n",
+    "\n",
+    "> The optimal policy maximizes the value function.\n",
+    "\n",
+    "We now turn our attention to defining the *optimal* value function,\n",
+    "which can be used to construct the **optimal policy** $\\pi^*$.\n",
+    "From Section 3.5 we know how to compute the value function for an arbitrary policy $\\pi$:\n",
+    "\n",
+    "$$V^\\pi(x) = \\bar{R}(x,\\pi(x)) + \\gamma \\sum_{x'} P(x'|x, \\pi(x)) V^\\pi(x').$$\n",
+    "\n",
     "\n",
-    "> Where we gather experience."
+    "To begin, we recall the famous **principle of optimality**\n",
+    "as stated by Bellman in a\n",
+    "[1960 article in the IEEE Transactions on Automatic Control](https://doi.org/10.1109/TAC.1960.6429288):\n",
+    "\n",
+    "> *An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.*"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's adapt the `policy_rollout` code from the previous section to generate a whole lot of experiences of the form $(x,a,x',r)$."
+    "This principle enables a key step in the derivation of a recursive formulation for the optimal policy. Indeed, the **optimal value function** $V^*: {\\cal X} \\rightarrow {\\cal A}$\n",
+    "is merely the value function for the optimal policy.\n",
+    "This can be written mathematically as\n",
+    "\n",
+    "$$\n",
+    "\\begin{aligned}\n",
+    "V^*(x) &= \\max_\\pi V^{\\pi}(x) \\\\\n",
+    "&=\n",
+    "\\max_\\pi \\left\\{ \\bar{R}(x,\\pi(x)) + \\gamma \\sum_{x'} P(x'|x, \\pi(x)) V^\\pi(x')   \\right\\}\\\\\n",
+    "&=\n",
+    "\\max_\\pi \\left\\{ \\bar{R}(x,\\pi(x)) + \\gamma \\sum_{x'} P(x'|x, \\pi(x)) V^*(x')   \\right\\}\\\\\n",
+    "&=\n",
+    "\\max_a  \\left\\{ \\bar{R}(x,a) + \\gamma \\sum_{x'} P(x'|x, a) V^*(x')   \\right\\} \\\\\n",
+    "\\end{aligned}\n",
+    "$$\n",
+    "\n",
+    "In the above, the second line follows immediately by using the definition of $V^\\pi$ above. The third line is more interesting.\n",
+    "By applying the principle of optimality, we replace $V^\\pi(x')$ with $V^*(x')$.\n",
+    "Simply put, if remaining decisions from state $x'$ must constitute an optimal policy,\n",
+    "the corresponding value function at $x'$ will be the optimal value function for $x'$.\n",
+    "For the fourth line,\n",
+    "because the value function has been written in recursive form,\n",
+    "$\\pi$ is only applied to the current state (i.e., when $\\pi$ is evaluated in the optimization,\n",
+    "it always appears as $\\pi(x)$).\n",
+    "Therefore, we can write the optimization\n",
+    "as a maximization with respect to the *action* applied in the *current state*, rather than as a\n",
+    "maximization with respect to the entire policy $\\pi$!\n",
+    "\n",
+    "\n",
+    "This equation is known as the **Bellman equation**.\n",
+    "It is named after Richard Bellman, the mathematician\n",
+    "who discovered it, and it is one of the most important equations in all of computer science.\n",
+    "The Bellman equation has a very nice interpretation: \n",
+    "the optimal value function of a state is the maximum expected reward \n",
+    "*plus* the discounted expected value function when acting optimally in the future."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Action Values and the Optimal Policy\n",
+    "\n",
+    "Using Bellman's equation, it is straightforward to compute the optimal policy $\\pi^*$ from a given state $x$:\n",
+    "\n",
+    "$$\n",
+    "\\pi^*(x) = \\arg\n",
+    "\\max_a  \\left\\{ \\bar{R}(x,a) + \\gamma \\sum_{x'} P(x'|x, a) V^*(x')   \\right\\}.\n",
+    "$$\n",
+    "\n",
+    "This computation is performed so often that it is convenient to introduce the so-called **$Q$-function**, which is the value of being in state $x$ and taking action $a$, for a given value function $V$:\n",
+    "\n",
+    "$$\n",
+    "\\begin{aligned}\n",
+    "Q(x,a;  V) \\doteq \\bar{R}(x,a) + \\gamma \\sum_{x'} P(x'|x, a) V(x') \n",
+    "\\end{aligned}\n",
+    "$$\n",
+    "\n",
+    "Another name for Q-values is **action values**, to be contrasted with the state values, i.e., the value function $V(x)$. They allow us to write the optimal policy $\\pi^*(x)$ simply as picking, for any given state $x$, the action $a$ with the highest action value $Q(x,a; V^*)$ computed from the optimal value function $V^*$:\n",
+    "\n",
+    "$$\n",
+    "\\pi^*(x) = \\arg \\max_a  Q(x,a; V^*)\n",
+    "$$\n",
+    "\n",
+    "We will use $Q$-values in many of the algorithms in this section, and an efficient way to compute a Q-value from a value function is given below:"
    ]
   },
   {
@@ -109,6 +192,393 @@
    "execution_count": 3,
    "metadata": {},
    "outputs": [],
+   "source": [
+    "def Q_value(V, x, a, gamma=0.9):\n",
+    "    \"\"\"Calculate Q(x,a) from given value function\"\"\"\n",
+    "    return T[x,a] @ (R[x,a] + gamma * V)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A very efficient way to compute all Q-values for all state-action pairs at once, using `numpy`, is\n",
+    "\n",
+    "```python\n",
+    "Q = np.sum(T * (R + gamma * V), axis=2)\n",
+    "```\n",
+    "\n",
+    "which we will also use below. It yields a matrix of size $|X| \\times |A|$."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exercise\n",
+    "\n",
+    "1. Try to understand the function `Q_value` above for calculating the Q-values. Use the notebook to investigate the calculation for specific values of $x$ and $a$.\n",
+    "\n",
+    "2. Similarly, try to understand the \"vectorized\" form above that yields the entire table of Q-values at once."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Policy Iteration\n",
+    "\n",
+    "> By iteratively improving an estimate of the optimal policy, we eventually find $\\pi^*$."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We will describe two methods for determining the optimal policy.\n",
+    "The method we describe below, policy iteration, iteratively improves candidate policies, ultimately converging to the optimal policy $\\pi^*$.\n",
+    "The second method, value iteration, iteratively improves an estimate of $V^*$, ultimately converging to the optimal value function.\n",
+    "Both, however, need access to the MDP's transition probabilities and the reward function.\n",
+    "\n",
+    "**Policy Iteration** starts with an initial guess at the optimal policy, and then iteratively improve our guess until no further improvements are possible.\n",
+    "In particular, policy iteration generates a sequence of policies\n",
+    "$\\pi^0, \\pi^1, \\dots \\pi^n$, such that $\\pi^{i+1}$ is better than policy $\\pi^i$.\n",
+    "This process ends when no further improvement is possible, which\n",
+    "occurs when $\\pi^{i+1} = \\pi^i.$\n",
+    "\n",
+    "To improve the policy $\\pi^i$, we update the action chosen *for each state* by applying\n",
+    "Bellman's equation using $\\pi^i$ in place of $\\pi^*$.\n",
+    "The can be achieved with the following algorithm:\n",
+    "\n",
+    "Start with a random policy $\\pi^0$ and $i=0$, and repeat until convergence:\n",
+    "1. Compute the value function $V^{\\pi^i}$\n",
+    "2. Improve the policy for each state $x \\in {\\cal X}$ using the update rule: \n",
+    "\n",
+    "$$\n",
+    "\\pi^{i+1}(x) \\leftarrow\\arg \\max_a Q(x,a; V^{\\pi^i})\n",
+    "$$\n",
+    "\n",
+    "3. Increment $i$\n",
+    "\n",
+    "Notice that this algorithm has the side benefit of computing \n",
+    "successively better approximations to the value function at each iteration.\n",
+    "Because there are a finite number of actions that can be applied in each state, there are only finitely many ways to update\n",
+    "a policy. Therefore, we expect this policy iteration algorithm to converge in finite time."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We already know how to do step (1) above, using the`calculate_value_function`.\n",
+    "The second step of the algorithm is easily implemented with the following code:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def update_policy(value_function):\n",
+    "    \"\"\"Update policy given a value function\"\"\"\n",
+    "    new_policy = [None for _ in range(5)]\n",
+    "    for x, room in enumerate(vacuum.rooms):\n",
+    "        Q_values = [Q_value(value_function, x, a) for a in range(4)]\n",
+    "        new_policy[x] = np.argmax(Q_values)\n",
+    "    return new_policy\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The whole policy iteration algorithm then simply iterates these until the policy no longer changes. If no initial policy is given, we can\n",
+    "start with a zero value function\n",
+    "$V^{\\pi^0}(x) = 0$ for all $x$:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def policy_iteration(pi=None, max_iterations=100):\n",
+    "    \"\"\"Do policy iteration, starting from policy `pi`.\"\"\"\n",
+    "    for _ in range(max_iterations):\n",
+    "        value_for_pi = vacuum.calculate_value_function(R, T, pi) if pi is not None else np.zeros((5,))\n",
+    "        new_policy = update_policy(value_for_pi)\n",
+    "        if new_policy == pi:\n",
+    "            return pi, value_for_pi\n",
+    "        pi = new_policy\n",
+    "    raise RuntimeError(\"No stable policy found after {max_iterations} iterations\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "On the other hand, if we have a guess for the initial policy, we can intialize\n",
+    "$\\pi^0$ accordingly.\n",
+    "For example, we can start with a not-so-smart `always_right` policy:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "RIGHT = vacuum.action_space.index(\"R\")\n",
+    "\n",
+    "always_right = [RIGHT, RIGHT, RIGHT, RIGHT, RIGHT]\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['L', 'L', 'R', 'U', 'U']\n"
+     ]
+    }
+   ],
+   "source": [
+    "optimal_policy, optimal_value_function = policy_iteration(always_right)\n",
+    "print([vacuum.action_space[a] for a in optimal_policy])\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Starting with the `always_right` policy, our policy iteration algorithm converges to an\n",
+    "intuitively pleasing policy.\n",
+    "In the dining room and kitchen we go `left`, in the office we go `right`, and in the hallway and dining room we go `up`.\n",
+    "This is significantly different from the `always_right` policy (which might be better named `almost_always_wrong`).\n",
+    "In fact, it is exactly the `reasonable_policy` that we created in Section 3.5.\n",
+    "We already knew that it should be pretty good at getting to the living room as fast as possible. In fact, it is optimal!\n",
+    "\n",
+    "We also print out the optimal value function below, which shows that if we are close to the living room the value function is very high, but it is a bit lower in the office in the dining room:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "  Living Room : 100.00\n",
+      "  Kitchen     : 97.56\n",
+      "  Office      : 85.66\n",
+      "  Hallway     : 97.56\n",
+      "  Dining Room : 85.66\n"
+     ]
+    }
+   ],
+   "source": [
+    "for i,room in enumerate(vacuum.rooms):\n",
+    "    print(f\"  {room:12}: {optimal_value_function[i]:.2f}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The optimal policy is also obtained when we start without a policy, starting with a zero value function:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['L', 'L', 'R', 'U', 'U']\n"
+     ]
+    }
+   ],
+   "source": [
+    "optimal_policy, _ = policy_iteration()\n",
+    "print([vacuum.action_space[a] for a in optimal_policy])\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Value Iteration\n",
+    "\n",
+    "> Dynamic programming can be used to obtain the optimal value function.\n",
+    "\n",
+    "Let us restate Bellman's equation, which must hold for each state $x$:\n",
+    "\n",
+    "$$\n",
+    "V^*(x) = \\max_a  \\left\\{ \\bar{R}(x,a) + \\gamma \\sum_{x'} P(x'|x, a) V^*(x')   \\right\\}.\n",
+    "$$\n",
+    "\n",
+    "If we have $n$ states, and since we would then have $n$ equations, it seems like we should be able to solve for the $n$ unknown values $V^*(x)$.\n",
+    "Sadly, they are not *linear* equations, as the maximization operation is not linear. Hence, unlike the case when the policy is fixed, we cannot just solve a system of linear equations to recover $V^*$.\n",
+    "\n",
+    "**Value iteration** approximates $V^*$ by constructing a sequence of estimates,\n",
+    "$V^0, V^1, \\dots , V^n$ that converges to $V^*$.\n",
+    "Starting with an initial guess, $V^0$, at each iteration we update\n",
+    "our approximation of the value function for each state by the update rule:\n",
+    "\n",
+    "$$\n",
+    "V^{i+1}(x) \\leftarrow \\max_a \\left\\{ \\bar{R}(x,a) + \\gamma \\sum_{x'} P(x'|x, a) V^i(x')   \\right\\} \n",
+    "$$\n",
+    "\n",
+    "Notice that the right hand side includes two terms:\n",
+    "the expected reward (which we can compute exactly), and a term in $V^i$ (our current best guess at the value function).\n",
+    "Value iteration operates by iteratively using our *current best guess* $V^i$ along with the *known* expected reward to update the approximation.\n",
+    "Unlike policy iteration, we do not expect value iteration to converge to the exact result in finite time.\n",
+    "Therefore, we cannot use $V^{i+1} = V^i$ as our termination condition.\n",
+    "Instead, we often use a condition such as $|V^{i+1} - V^i| < \\epsilon$, for some small value of $\\epsilon$\n",
+    "as the termination condition.\n",
+    "\n",
+    "Finally, note that we can once again use the Q-values to obtain a very concise description for the value update:\n",
+    "\n",
+    "$$\n",
+    "V^{i+1}(x) \\leftarrow \\max_a Q(x, a; V^i).\n",
+    "$$\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In code, this is actually easier than policy iteration, using the concise vectorized Q-table update we discussed above:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[100.  98.  90.  98.  90.]\n",
+      "[100.    97.64  86.76  97.64  86.76]\n",
+      "[100.    97.58  85.92  97.58  85.92]\n",
+      "[100.    97.56  85.72  97.56  85.72]\n",
+      "[100.    97.56  85.68  97.56  85.68]\n",
+      "[100.    97.56  85.67  97.56  85.67]\n",
+      "[100.    97.56  85.66  97.56  85.66]\n",
+      "[100.    97.56  85.66  97.56  85.66]\n",
+      "[100.    97.56  85.66  97.56  85.66]\n",
+      "[100.    97.56  85.66  97.56  85.66]\n"
+     ]
+    }
+   ],
+   "source": [
+    "V_k = np.full((5,), 100)\n",
+    "for k in range(10):\n",
+    "    Q_k = np.sum(T * (R + 0.9 * V_k), axis=2) # 5 x 4\n",
+    "    V_k = np.max(Q_k, axis=1) # max over actions\n",
+    "    print(np.round(V_k,2))\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Compare with optimal value function:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[100.    97.56  85.66  97.56  85.66]\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(np.round(optimal_value_function, 2))\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And we can easily *extract* the optimal policy:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "policy = [0 0 1 2 0]\n",
+      "['L', 'L', 'R', 'U', 'L']\n"
+     ]
+    }
+   ],
+   "source": [
+    "Q_k = np.sum(T * (R + 0.9 * V_k), axis=2)\n",
+    "pi_k = np.argmax(Q_k, axis=1)\n",
+    "print(f\"policy = {pi_k}\")\n",
+    "print([vacuum.action_space[a] for a in pi_k])\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exercise\n",
+    "\n",
+    "1. Above we initialized the value function at 100 everywhere. Examine the effect on convergence of initializing it differently.\n",
+    "\n",
+    "2. Implement a convergence criterion that stops the iterations after convergence."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Model-based Reinforcement Learning\n",
+    "\n",
+    "> Just explore, then solve the MDP.\n",
+    "\n",
+    "We can attempt to *learn* the MDP and then solve it. Both policy and value iteration require access to the transition probabilities $T$ and the reward function $R$. However, when faced with a new environment, we might not know how our robot will behave. And likewise, we might not have access to the reward function: how can we know in advance where we will find pots of gold?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "One way to learn the MDP is to randomly explore. Let's adapt the `policy_rollout` code from the previous section to generate a whole lot of *experiences* of the form $(x,a,x',r)$."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
    "source": [
     "def explore_randomly(x1, horizon=N):\n",
     "    \"\"\"Roll out states given a random policy, for given horizon.\"\"\"\n",
@@ -127,34 +597,25 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let us use it to create 499 experiences and show the first 10:"
+    "Let us use it to create 499 experiences and show the first 5:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 14,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "[(0, 1, 0, 10.0), (0, 1, 1, 0.0), (1, 1, 1, 0.0), (1, 0, 1, 0.0), (1, 3, 4, 0.0), (4, 1, 4, 0.0), (4, 2, 1, 0.0), (1, 0, 0, 10.0), (0, 1, 1, 0.0), (1, 3, 1, 0.0)]\n"
+      "[(0, 3, 0, 10.0), (0, 2, 0, 10.0), (0, 3, 3, 0.0), (3, 3, 3, 0.0), (3, 2, 3, 0.0)]\n"
      ]
     }
    ],
    "source": [
     "data = explore_randomly(vacuum.rooms.index(\"Living Room\"), horizon=500)\n",
-    "print(data[:10])\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Model-based Reinforcement Learning\n",
-    "\n",
-    "> Just count, then solve the MDP."
+    "print(data[:5])\n"
    ]
   },
   {
@@ -163,7 +624,7 @@
    "source": [
     "We can *estimate* the transition probabilities $T$ and reward table $R$ from the data, and then we can use the algorithms from before to calculate the value function and/or optimal policy.\n",
     "\n",
-    "The math is just a variant of what we saw in the learning section of the last chapter. The rewards is easiest:\n",
+    "The math is just a variant of what we saw in the learning section of the last chapter. The rewards are the easiest to estimate:\n",
     "\n",
     "$$\n",
     "R(x,a,x') \\approx \\frac{1}{N(x,a,x')} \\sum_{x,a,x'} r\n",
@@ -182,7 +643,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 15,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -206,20 +667,20 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 16,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "array([[17., 22., 21., 21.],\n",
-       "       [25., 26., 26., 30.],\n",
-       "       [32., 25., 24., 22.],\n",
-       "       [26., 22., 14., 18.],\n",
-       "       [28., 42., 29., 29.]])"
+       "array([[30., 22., 23., 30.],\n",
+       "       [23., 24., 26., 25.],\n",
+       "       [13., 25., 19., 19.],\n",
+       "       [23., 32., 32., 33.],\n",
+       "       [28., 25., 30., 17.]])"
       ]
      },
-     "execution_count": 6,
+     "execution_count": 16,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -239,7 +700,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 17,
    "metadata": {},
    "outputs": [
     {
@@ -253,9 +714,9 @@
       " [0.2 0.  0.  0.8 0. ]]\n",
       "estimate:\n",
       "[[1.   0.   0.   0.   0.  ]\n",
-      " [0.23 0.77 0.   0.   0.  ]\n",
+      " [0.32 0.68 0.   0.   0.  ]\n",
       " [1.   0.   0.   0.   0.  ]\n",
-      " [0.33 0.   0.   0.67 0.  ]]\n"
+      " [0.2  0.   0.   0.8  0.  ]]\n"
      ]
     }
    ],
@@ -273,7 +734,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 18,
    "metadata": {},
    "outputs": [
     {
@@ -302,7 +763,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In summary, learning in this context can simply be done by gathering lots of experiences, and estimating models for how the world behaves."
+    "In summary, learning in this context can simply be done by gathering lots of experiences, and estimating models for how the world behaves. After that, you can use either policy or value iteration to recover the optimal policy."
    ]
   },
   {
@@ -311,26 +772,20 @@
    "source": [
     "## Model-free Reinforcement Learning\n",
     "\n",
-    "> All you need is Q, la la la la."
+    "> All you need is Q."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "A different, model-free approach is **Q_learning**. In the above we tried to *model* the world by trying estimate the (large) transition and reward tables. However, remember from the previous section that there is a much smaller table of Q-values $Q(x,a)$ that also allow us to act optimally, because we have\n",
-    "\n",
-    "$$\n",
-    "\\pi^*(x) = \\arg \\max_a Q^*(x,a)\n",
-    "$$\n",
-    "\n",
-    "where the Q-values are defined as\n",
+    "A different, model-free approach is **Q_learning**. In the above we tried to *model* the world by trying estimate the (large) transition and reward tables. However, remember from the previous section that there is a much smaller table of Q-values $Q(x,a)$ that also allow us to act optimally. This is because we can calculate the optimal policy $\\pi^*(x)$ from the optimal Q-values $Q^*(x,a) \\doteq Q(x, a; V^*)$:\n",
     "\n",
     "$$\n",
-    "Q^*(x,a) \\doteq \\bar{R}(x,a) + \\gamma \\sum_{x'} P(x'|x, a) V^*(x')\n",
+    "\\pi^*(x) = \\arg \\max_a Q^*(x,a).\n",
     "$$\n",
     "\n",
-    "This begs the question whether we can simply learn the Q-values instead, which might be more *sample-efficient*, i.e., we would get more accurate values with less training data, as we have less quantities to estimate.\n",
+    "This begs the question whether we can simply learn the Q-values instead, which might be more *sample-efficient*. In other words, we would get more accurate values with less training data, as we have less quantities to estimate.\n",
     "\n",
     "To do this, remember that the Bellman equation can be written as \n",
     "\n",
@@ -361,18 +816,18 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 19,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "[[70.16615474 66.74151053 74.12119126 67.74557498]\n",
-      " [65.90656811 53.42049487 52.92272442 53.19868122]\n",
-      " [48.86928102 54.59959115 49.762931   51.03132376]\n",
-      " [56.57556793 52.43518011 68.36941774 61.11737138]\n",
-      " [59.89698619 53.19146259 53.26810274 53.01350551]]\n"
+      "[[86.40766254 77.92098421 84.74734752 78.34155364]\n",
+      " [77.06276123 72.92342935 72.46003579 66.34761267]\n",
+      " [46.22374231 72.00930878 47.51226021 54.96612765]\n",
+      " [58.46553085 67.75606372 85.08905827 73.9787325 ]\n",
+      " [74.71184623 63.98503874 72.80753072 66.4324432 ]]\n"
      ]
     }
    ],
@@ -393,6 +848,47 @@
    "source": [
     "These values are not yet quite accurate, as you can ascertain yourself by changing the number of experiences above, but note that an optimal policy can be achieved before we even converge."
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exploration vs Exploitation\n",
+    "\n",
+    "The above assumed that we gather data by acting *randomly*, but that might be very inefficient. Indeed, we might be spending a lot of time - literally - bumping our heads into the walls. A better idea might be to act randomly at first (exploration), but as time progresses, spend more and more time refining the optimal policy by trying to act optimally (exploitation).\n",
+    "\n",
+    "Greedy action selection can lead to bad learning outcomes. We will use Q-learning as an example, but similar problems exist for other reinforcement learning methods. During Q-learning, upon reaching a state $x$, the **greedy action selection** method is to simply pick the action $a^*$ according to the *current* estimate of the Q-values:\n",
+    "\n",
+    "$$\n",
+    "a^* = \\arg \\max_a \\hat{Q}(x,a).\n",
+    "$$\n",
+    "\n",
+    "Unfortunately, this tends to often lead to Q-learning getting stuck in local minima of the policy search space: state-action pairs that might be more promising are never visited as their correct (higher) Q-values have not been estimated correctly, so they always get passed over.\n",
+    "\n",
+    "Epsilon-greedy or $\\epsilon$-greedy methods balance exploration with exploitation while learning. Instead of always choosing the best possible action according to the current estimate, we could simply choose an action at random a fraction of the time, say with probability $\\epsilon$. This is the **epsilon-greedy** method. Typical values for $\\epsilon$ are 0.01 or even 0.1, i.e., 10% of the time we choose to act randomly. Schemes also exist to decrease $\\epsilon$ over time.\n",
+    "\n",
+    "### Exercise\n",
+    "\n",
+    "Think about how to apply $\\epsilon$-greedy methods in the model-based reinforcement learning method we discussed above."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summary\n",
+    "\n",
+    "We discussed\n",
+    "\n",
+    "- The optimal policy and value function, governed by the Bellman equation.\n",
+    "- Two algorithms to compute those: policy iteration and value iteration.\n",
+    "- A model-based method to learn from experience.\n",
+    "- A model-free method, Q-learning, that updates the action values.\n",
+    "- Balancing exploitation and exploration.\n",
+    "\n",
+    "The field of reinforcement learning is much richer, and we will return to  it several times throughout this book.\n",
+    "\n"
+   ]
   }
  ],
  "metadata": {
@@ -420,7 +916,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.18"
+   "version": "3.9.19"
   },
   "latex_metadata": {
    "affiliation": "Georgia Institute of Technology",