Markov Decision Processes

3.5. Markov Decision Processes#

For controlled Markov chains, planning is the process of choosing the control inputs. This leads to the concept of Markov decision processes (MDPs).

Previously, in Section 3.2, we described how conditional probability distributions can be used to model uncertainty in the effects of actions. We defined the belief state \(b_{k+1}\) to be the posterior probability \(P(X_{k+1} | a_1 \dots a_k)\) for the state \(X_{k+1}\) at time \(k+1\) given the sequence of actions \(a_1 \dots a_k\). In every example, the sequence of actions was predetermined, and we merely calculated probabilities associated with performing these actions from some specified initial state, governed by the probability distribution \(P(X_1)\).

In this section, we consider the problem of choosing which actions to execute. Making these decisions requires that we have quantitative criteria for evaluating actions and their effects. We encode these criteria using reward functions. Because the effects of actions are uncertain, it is not possible to know the reward that will be obtained by executing a specific action (or a sequence of actions). Thus, we will again invoke the concept of expectation to compute expected future benefits of applying actions.

As we work through these concepts, it will rapidly become apparent that executing a predefined sequence of actions is not the best way to face the future. Suppose, for example, that our vacuuming robot wants to move from the office to the living room. Because it is not possible to know how many times we should “move right” to arrive at the hallway, we might construct a sequence of actions to move right many times (to increase the probability of arriving at the dining room), then move up many times (to increase the probability of arriving at the kitchen), then move left many times (to increase the probability of arriving at the living room).

This kind of plan might make sense if the robot is unable to know its current location. For example, moving for a long time to the right should eventually bring the robot to the dining room with high probability, but if the robot moves to the right for a short time, it could remain in the office, arrive at the hallway, or arrive at the dining room.

In contrast, if the robot is able to know its location (using perception), it can act opportunistically when it reaches the hallway, and immediately move up. The key idea here is that the best action at any moment in time depends on the state in which the action is executed. The recipe of which action to execute in each state is called a policy, and defining policies and their associated value function is the main goal for this section.

3.5.1. Reward Functions#

Reward functions provide a quantitative way to evaluate the effects of actions.

How should our robot evaluate the effects of actions? One simple approach might be to specify the robot’s goal, and then determine how effective an action might be with respect to that goal, e.g., by using the conditional probability tables for the various actions. If, for example, the goal is to arrive to the living room, we could evaluate the probability that a particular action would achieve the goal.

This approach has several limitations. First, if the robot is not currently in a room that is adjacent to the living room, it is not clear how to measure the possible progress toward the living room for a specific action. If, for example, the robot is in the dining room, would it be better to move to the kitchen or to the hallway as an intermediate step toward the living room? Second, this approach does not allow the robot to consider benefits that could arise from visiting other states. For example, even though the living room is the goal, arriving to the kitchen might not be such a bad outcome, if for example, the kitchen floor is also in need of cleaning. Finally, it may be the case that the benefit of executing an action depends not only on the destination, but also on the state in which the action is executed. For example, entering the living room from the hallway might be less desirable than entering from the kitchen (if, for example, guests are likely to arrive in the entrance hallway).

Reward functions provide a very general solution that addresses all of these limitation. Let \(\cal X\) denote the set of states and \(\cal A\) the set of actions. The reward function, \(R : {\cal X} \times {\cal A} \times {\cal X} \rightarrow \mathbb{R}\), assigns a numeric reward to specific state transitions under specific actions. In particular, \(R(x_k, a_k, x_{k+1})\) is the reward obtained by arriving to state \(x_{k+1}\) from state \(x_k\) as a result of executing action \(a_k\).

In what follows we assume that the reward function is time-invariant, i.e., the benefit of moving up from the hallway to the living room does not change as time passes. This allows us to use the more compact notation \(R(x, a, x')\) to represent the reward obtained by arriving to (the next) state \(x'\) by executing action \(a\) in state \(x\). We will frequently use this notation in the remainder of the section.

This form of the reward function is very general, allowing us to encode context dependent rewards that depend on the state in which an action is executed. It is also common to specify rewards merely as a function of state. In this case, we denote by \(R(x)\) the reward obtained for arriving to state \(x\), regardless of the action that was applied, and regardless of the previous state. An example of such a reward function is given below, implemented as a python function. It simply returns a reward of 10 upon entering the living room:

def reward_function(state:int, action:int, next_state:int):
    """Reward that returns 10 upon entering the living room."""
    return 10.0 if next_state == "Living Room" else 0.0

print(reward_function("Kitchen", "L", "Living Room"))
print(reward_function("Kitchen", "L", "Kitchen"))

10.0
0.0

3.5.2. Expected Reward#

Maximizing immediate expected reward leads to greedy planning.

Because the effects of actions are uncertain, there is no way for the robot to deterministically predict the value of the next state, \(x'\), for a particular action \(a\); therefore, it is not possible to evaluate \(R(x,a,x')\). In this case, we can use the expected reward to evaluate the stochastic effects of the action. As we have seen in the previous chapter, this expectation will provide a good estimate of the average reward that will be received if action \(a\) is executed from state \(x\) may times, even if it provides no guarantees about the reward that will be obtained by any specific moment in time.

We denote the expected reward for executing action \(a\) in state \(x\) by \(\bar{R}(x,a)\), and its value is obtained by evaluating the expectation

\[ \bar{R}(x,a) \doteq E[R(x,a,X')] = \sum_{x'} P(x'|x,a) R(x, a, x'). \]

Note that we use the upper case \(X'\) to indicate that the expectation is taken with respect to the random next state. Accordingly, the sum is over all possible next states, and the reward \(R(x, a, x')\) for each next state is weighted by the probability \(P(x'|x,a)\) of arriving to that state from state \(x\) by executing action \(a\).

To calculate the expected reward in code, we first create a multidimensional array \(R(x,a,x')\) that contains the rewards for every possible transition \(x,a \rightarrow x'\):

R = np.empty((5, 4, 5), float)
for x, room_x in enumerate(vacuum.rooms):
    for a, action in enumerate(vacuum.action_space):
        for y, room_y in enumerate(vacuum.rooms):
            R[x, a, y] = 10.0 if room_y == "Living Room" else 0.0

# For example, taking action "L" in "Kitchen":
R[vacuum.rooms.index("Kitchen"), vacuum.action_space.index("L")]

array([10.,  0.,  0.,  0.,  0.])

The result above is simple to interpret: if the robot starts in the kitchen and goes left, it obtains a reward of 10 iff it arrives to the living room (index 0), otherwise 0.

Observe that \(R\) is a multidimensional array, with shape \((5, 4, 5)\), corresponding to the dimensionality of the arguments of the reward function \(R(x,a,x')\). We can do the same with the transition probabilities, creating a multidimensional array \(T\) with an entry for every possible transition \(x \rightarrow a \rightarrow x'\) with probability \(P(x'|a,x)\):

# First create a conditional with ids 0, 1, and 2 for state, action, and next state:
conditional = gtsam.DiscreteConditional((2,5), [(0,5), (1,4)], vacuum.action_spec)

# Now use `enumerate` to iterate over all possible assignments to the conditional:
T = np.empty((5, 4, 5), float)
for assignment, P_y_given_ax in conditional.enumerate():
    x, a, y = assignment[0], assignment[1], assignment[2]
    T[x, a, y] = P_y_given_ax

# For example, taking action "L" in "Kitchen":
T[vacuum.rooms.index("Kitchen"), vacuum.action_space.index("L")]

array([0.8, 0.2, 0. , 0. , 0. ])

As expected, if we execute the action L when the robot is in the kitchen, we end up in the living room with 80% chance, while staying in place (the kitchen) 20% of the time.

Then, finally, calculating the expected reward \(\bar{R}(x,a)\) for a particular state-action pair \((x,a)\) becomes a simple dot product:

x, a = vacuum.rooms.index("Kitchen"), vacuum.action_space.index("L")
print(f"Expected reward (Kitchen, L) = {np.dot(T[x,a], R[x,a])}")

Expected reward (Kitchen, L) = 8.0

This makes total sense: if we stay in the kitchen, we get zero reward, but if we succeed to move to the living room, we receive a reward of 10. This happens only in 80% of the cases, though, so the expected reward is only 8.

Equipped with the definition of expected reward, we can introduce a first, greedy planning algorithm: Given the current state \(x_k\), execute the action that maximizes the expected reward:

\[ a_k^* = \arg \max_{a \in {\cal A}} E[ R(x_k, a, X_{k+t})] \]

For example, in the kitchen, we can calculate the expected award for all actions:

x = vacuum.rooms.index("Kitchen")
for a in range(4):
    print(f"Expected reward ({vacuum.rooms[x]}, {vacuum.action_space[a]}) = {np.dot(T[x,a], R[x,a])}")

Expected reward (Kitchen, L) = 8.0
Expected reward (Kitchen, R) = 0.0
Expected reward (Kitchen, U) = 0.0
Expected reward (Kitchen, D) = 0.0

It turns out that \(8.0\) is the best we can do by greedily choosing a single action, as the expected immediate reward for any other action is zero. Hence, whenever we find ourselves in the Kitchen, a good course of action is to always go left!

3.5.3. Utility for a Defined Sequence of Actions#

Utility is the total discounted reward over finite or infinite horizons.

The greedy strategy above focuses on the immediate benefit of applying an action. For our robot, and for most robots operating in the real world, it is important to perform effectively over a prolonged period of time, not merely for the next instant. In order to maximize long-term benefit, instead of considering only the next-stage reward (as the greedy strategy does) we could consider the sum of all rewards achieved in the future. There are two immediate disadvantages to a direct implementation of this approach:

First, because the effects of actions are uncertain, it is likely that the further we look into the future, the more uncertain we will be about the robot’s anticipated state. Therefore, it makes sense to discount our consideration of rewards that might occur far into the future, say at times \(X_{k+l}\), for increasing values of \(l\).
Second, it is often convenient to reason with an infinite time horizon, e.g., consider the case when the robot will operate forever. While this is certainly not a realistic expectation, reasoning over an infinite time horizon often simplifies the mathematical complexities of planning into the future. If we merely compute the sum of all future rewards, this sum will diverge to infinity as \(l\) approaches infinity.

We can deal with both of these disadvantages by multiplying the reward at time step \(k + l\) by a discounting factor \(\gamma^l\), with \(0 < \gamma \leq 1\). We refer to \(\gamma\) as the discount factor and to the term \(\gamma^l R(x_{k+l},a_{k+l},x_{k+l+1})\) as a discounted reward. Note that for \(\gamma = 1\), there is no discount, and all future rewards are treated with equal weight.

We now use this to define a more general utility function. Suppose the robot executes a sequence of actions, \(a_1, \dots, a_n\), starting in state \(X_1=x_1\), and passes through state sequence \(x_1,x_2,x_3\dots x_{n+1}\). We define the utility function \(U: {\cal A}^n \times {\cal X}^{n+1} \rightarrow \mathbb{R}\) as

\[ U(a_1, \dots, a_n, x_1, \dots x_{n+1}) = R(x_1,a_1, x_2) + \gamma R(x_2, a_2, x_3) + \dots \gamma^{n-1} R(x_{n}, a_{n}, x_{n+1})\]

We can write this more compactly as the summation

\[ U(a_1, \dots, a_n, x_1, \dots x_{n+1}) = \sum_{k=1}^{n} = \gamma^{k-1} R(x_k, a_k, x_{k+1}) \]

For \(n < \infty\), we refer to this as a finite-horizon utility function. Note that influence of future rewards decreases exponentially with the time horizon, and that the use of \(\gamma^{k-1}\) ensures that the sum will converge for the infinite horizon case (under mild assumptions about \(R\)).

The expression above is defined for a specific sequence of actions and a specific sequence of states. When planning, as we have noted above, we are unable to know with certainty the future states. We can, again, deal with this difficulty by computing the expected utility for a given sequence of actions, \(E[U(a_1, \dots, a_n, X_1, \dots X_n)]\). We can now formulate a slightly more sophisticated version of our planning problem: choose the sequence of actions \(a_{1:n}\) that maximizes the expected utility:

\[ a_1^*, \dots a_n^* = \arg \max_{a_1 \dots a_n \in {\cal A}^n} E[U(a_1, \dots, a_n, X_1, \dots X_{n+1})] \]

As formulated above, this problem could be solved by merely enumerating every possible action sequence, and choosing the sequence that maximizes the expectation. Obviously this is not a computationally tractable approach. Not only does the number of possible action sequences grow exponentially with the time horizon \(n\), but computing the expectation for a specific action sequence is also computationally heavy. We can, however, approximate this optimization process using the concept of rollouts, as we will now see.

3.5.4. Approximating Expected Utility Using Control Tape Rollouts#

Rollin’, rollin’, rollin’…

Consider the computation required to determine the expected utility for a sequence of only two actions:

\[ E[U(a_1,a_2, X_1, X_2, X_3)] = E[R(X_1, a_1, X_2) + \gamma R(X_2, a_2, X_3)] \]

Computing the expectation requires summing over all combinations of values for states \(X_1, X_2, X_3\). Clearly, as \(n\) becomes large, this is not a tractable computation. We can approximate this computation by recalling the relationship between the expected value for a probability distribution and the average value over many realizations of the underlying random process. Namely, as we have seen before, the expected value of a random variable (in this case the value of the utility function) corresponds to what we expect to observe as the average of that random variable over many trials. This immediately suggests an approximation algorithm:

Generate many sample trajectories for the action sequence \(a_1, \dots, a_n\);
compute the average of the discounted rewards over these sample trajectories.

A specific sequence of actions, \(a_1, \dots, a_n\) is sometimes called a control tape. The process of computing a sample trajectory and evaluating the corresponding utility is called a rollout for the control tape. Each rollout produces one sample trajectory, which gives one corresponding utility value. As an example, suppose the robot starts in the office executes the sequence \(a_1, a_2, a_3, a_4\) = R,U,L,L (i.e., the robot executes actions move right, move up, move left, move left) in an attempt to reach the living room.

Because we took care to specify the Markov chain above in reverse topological order, we can then use the GTSAM method sample to do the rollouts for us:

markovChain = gtsam.DiscreteBayesNet()
for k in reversed(range(1,N)):
    markovChain.add(X[k+1], [X[k], A[k]], vacuum.action_spec)

#| caption: A Markov chain with 5 states and 4 actions. The actions are given.
#| label: fig:markov_chain_5
show(markovChain, hints={"A":2, "X":1, "Z":0}, boxes={A[k][0] for k in range(1,N)})

_images/c659d6271629de468845d36c0d1b4a52bb819580bdb95e6aaa98ff3a7315a83d.svg

def perform_rollout(x1, actions):
    """Roll out states given actions as a dictionary"""
    dict = actions.copy()
    dict[X[1]] = x1
    given = VARIABLES.assignment(dict)
    return markovChain.sample(given)

To execute a specific rollout for the control tape R,U,L,L, we can use the following code:

actions = {A[1]:"R", A[2]:"U", A[3]:"L", A[4]:"L"}
rollout = perform_rollout("Office", actions)
pretty(rollout)

Variable	value
A1	R
A2	U
A3	L
A4	L
X1	Office
X2	Office
X3	Office
X4	Office
X5	Office

It is important to remember that any individual rollout corresponds to a sample trajectory from a stochastic process. If you execute the above code several times, you should observe that the robot does not always arrive to the living room.

The code below executes the rollout and computes the corresponding utility for the sample trajectory. Sample trajectories that do not arrive to the living room will have zero utility.

def reward(rollout, k):
    """Return state, action, next_state triple for given rollout at time k."""
    state = rollout[X[k][0]]
    action = rollout[A[k][0]]
    next_state = rollout[X[k+1][0]]
    return R[state, action, next_state]

def rollout_reward(rollout, horizon=N, gamma=1.0):
    """Calculate reward for a given rollout"""
    discounted_rewards = [gamma**(k-1) * reward(rollout,k) for k in range(1,horizon)]
    return sum(discounted_rewards)

print(rollout_reward(rollout))

0.0

The following code executes 20 rollouts for the action sequence R,U,L,L, and prints the utility for each sample trajectory. You can see that in many cases the robot fails to arrive to the living room (thus earning zero utility)! This is because each of the first two actions have a 0.2 probability of failure, and if either of these fail, the robot is unable to reach the living room using this control tape.

def control_tape_reward(x1, actions):
    """Calculate reward given a dictionary of actions"""
    rollout = perform_rollout(x1, actions)
    return rollout_reward(rollout)

print([control_tape_reward("Office", actions) for i in range(20)])

[0.0, 30.0, 30.0, 30.0, 30.0, 30.0, 30.0, 30.0, 0.0, 30.0, 30.0, 0.0, 30.0, 30.0, 30.0, 30.0, 30.0, 30.0, 30.0, 0.0]

Finally, the expected utility of the action sequence R,U,L,L is approximated by simply averaging over all 20 rollouts:

np.mean([control_tape_reward("Office", actions) for i in range(20)])

19.5

3.5.4.1. Exercises:#

Execute both code segments above multiple times and observe the effect.
The rewards above seems to be always either zero or 30. Why is that?

3.5.5. Policies#

A policy is a function that specifies which action to take in each state.

In the example above, the control tape rollout for the action sequence R,U,L,L failed many times. The reason for this failure may seem obvious: the robot executed the same sequence of actions, regardless of the state trajectory. If the robot had been able to choose its actions based on the current state, it would have chosen to move right until it reached the hallway, at which time it would have chosen to move up until reaching the living room. Clearly, the robot could make better choices if it were allowed to dynamically choose which action to execute based on its current state.

A policy, \(\pi: {\cal X} \rightarrow {\cal A}\) is a mapping from states to actions. Specifying a policy instead of a control tape has the potential to significantly improve the robot’s performance, by adapting the action sequence based on the actual state trajectory that occurs during execution.

The code below defines a fairly intuitive policy. If in the office, move right. If in the dining room or kitchen, move left. If in the hallway or living room, move up. Note we implement policies as a simple list in python, so pi[0] is the the action taken in state with index 0 (the Living Room). We return an index into the action_space to make the code following it efficient:

RIGHT = vacuum.action_space.index("R")
LEFT  = vacuum.action_space.index("L")
UP    = vacuum.action_space.index("U")
DOWN  = vacuum.action_space.index("D")

reasonable_policy = [UP, LEFT, RIGHT, UP, LEFT]

Once we have a given policy, \(\pi\), we can compute a policy rollout in a manner analogous to computing control tape rollouts described above. In particular, at each state, instead of sampling from the distribution \(P(X_{k+1} | a_k, x_k)\) we sample from the distribution \(P(X_{k+1} | \pi(x_k), x_k)\). In other words, instead of simulating a pre-specified action \(a_k\), we choose \(a_k = \pi(x_k)\).

Here is a function that computes a rollout given a policy, rather than a control tape:

def policy_rollout(x1, pi, horizon=N):
    """Roll out states given a policy pi, for given horizon."""
    rollout = gtsam.DiscreteValues()
    x = x1
    for k in range(1, horizon):
        a = pi[x]
        rollout[X[k][0]] = x
        rollout[A[k][0]] = a
        next_state_distribution = gtsam.DiscreteDistribution(X[k+1], T[x, a])
        x = next_state_distribution.sample()
    rollout[X[horizon][0]] = x
    return rollout

pretty(policy_rollout(vacuum.rooms.index("Office"), reasonable_policy, horizon=5))

Variable	value
A1	R
A2	U
A3	U
A4	U
X1	Office
X2	Hallway
X3	Living Room
X4	Living Room
X5	Living Room

Our policy-based rollouts seem to be much better!

In summary, we proposed two methods to plan an action sequence, a greedy approach that maximizes the expected reward for executing a single action, and an optimization-based method that chooses a fixed action sequence to maximize the expected utility of the action sequence. We described how to implement the computations required for the latter approach using control tape rollouts.

With the introduction of policies, planning is reduced to the search for an appropriate policy. In the best case, this policy would maximize the expected utility. Because policies are state-dependent, the combinatorics of enumerating all possible policies precludes any approach that attempts to explicitly enumerate candidate policies. Instead, we will focus on methods that explore the utility associated to a policy. As we will now see, this is somewhat more complicated than exploring the utility for a fixed sequence of actions.

3.5.6. The Value Function (for a given policy)#

The value function \(V^\pi\) measures the expected utility from each state, under a given policy \(\pi\).

We would now like to characterize the quality of any given policy. Above, we defined the utility for a specific sequence of \(n\) actions as

\[ U(a_1, \dots, a_n, x_1, \dots x_{n+1}) = R(x_1,a_1, x_2) + \gamma R(x_2, a_2, x_3) + \dots \gamma^{n-1} R(x_{n}, a_{n}, x_{n+1}) \]

and used the expected utility, \(E[U(a_1, \dots, a_n, X_1, \dots X_{n+1})]\) as a quantitative measure of the efficacy of the specific action sequence. We can apply this same type of reasoning for a policy, \(\pi\). When evaluating policies, it is typical to use a discounted reward over an infinite time horizon. In this case, we define the value function \(V^\pi:{\cal X} \rightarrow \mathbb{R}\) for policy \(\pi\) as

\[ V^\pi(x_1) \doteq E [R(x_1, \pi(x_1), X_2) + \gamma R(X_2, \pi(X_2), X_3) + \gamma^2 R(X_3, \pi(X_3), X_4) + \dots] \]

where the expectation is taken over the possible values of the states \(X_2, X_3 \dots\). Note that the policy \(\pi\) is deterministic, but that \(\pi(X_i)\) is the random action that results from applying the deterministic policy to the stochastic state \(X_i\).

The value function for a policy can be written in a nice recursive form that can be obtained by the following derivation.

\[\begin{split} \begin{aligned} V^\pi(x_1) &= E [R(x_1, \pi(x_1), X_2) + \gamma R(X_2, \pi(X_2), X_3) + \gamma^2 R(X_3, \pi(X_3), X_4) + \dots] \\ &= \sum_{x_2} P(x_2|x_1, \pi(x_1)) R(x_1, \pi(x_1), x_2) \\ & + \sum_{x_2} P(x_2|x_1, \pi(x_1)) E [\gamma R(x_2, \pi(x_2), X_3) + \gamma^2 R(X_3, \pi(X_3), X_4) + \dots]\\ &= \sum_{x_2} P(x_2|x_1, \pi(x_1)) \{R(x_1, \pi(x_1), x_2) + \gamma V^\pi(x_2)\} \end{aligned} \end{split}\]

where the second line is obtained by explicitly writing the summation for the expectation taken with respect to the random state \(X_2\), and the third line is obtained by noticing that \(E [\gamma R(x_2, \pi(x_2), X_3) + \gamma^2 R(X_3, \pi(X_3), X_4) + \dots] = \gamma V^\pi(x_2)\).

We can apply the distributivity property to this expression to obtain

\[\begin{split} \begin{aligned} V^\pi(x_1) &= \sum_{x_2} P(x_2|x_1, \pi(x_1)) R(x_1, \pi(x_1), x_2) + \gamma \sum_{x_2} P(x_2|x_1, \pi(x_1)) V^\pi(x_2) \\ &= \bar{R}(x_1,\pi(x_1)) + \gamma \sum_{x_2} P(x_2|x_1, \pi(x_1)) V^\pi(x_2) \end{aligned} \end{split}\]

in which the term \(\bar{R}(x_1,\pi(x_1))\) is the expected reward for applying action \(a = \pi(x_1)\) in state \(x_1\), which we already know how to compute from the reward function \(R\) and transition probabilities \(T\). By now substituting \(x\) for \(x_1\), and \(x'\) for \(x_{2}\) we can generalize this expression to apply to the state \(x\) at any arbitrary time:

\[V^\pi(x) = \bar{R}(x,\pi(x)) + \gamma \sum_{x'} P(x'|x, \pi(x)) V^\pi(x')\]

This has a very nice interpretation: the value of a state under a given policy is the expected reward \(\bar{R}(x,\pi(x))\) under that policy, plus the discounted expected value of the next state.

3.5.6.1. Thought exercise#

Without a discount factor, we would always be hopeful that if we take one more step, we will find a pot of gold. Reflect on what various values for the discount factor mean in real life.

3.5.7. Approximating the Value Function#

We can use rollouts to approximate the value function as well.

Just as we approximated the expected utility of an action sequence using control tape rollouts, we can approximate the value function by sampling over a number of policy rollouts. This process provides a Monte Carlo approximation of the value function for a given policy. Of course we cannot apply a policy rollout over an infinite time horizon, so we apply the rollout only to some finite number of time steps, say \(N_{\rm{ro}}\).

Note this is an approximation in two ways:

We approximate the expectation by averaging over \(N_{\rm{Samples}}\) sample trajectories.
We only roll out for \(N_{\rm{ro}}\) steps. We can improve on this by increasing the number of samples or by increasing the horizon, but at a cost linear in their product, i.e., \(O(N_{\rm{ro}} N_{\rm{Samples}})\):

You might worry that evaluating sample paths of length \(N_{\rm{ro}}\) could lead to significant errors in our approximation. In fact, it is easy to determine an upper bound on this error. Since \(R(x, a, x') \) is finite, we know that there is some upper bound \(R_{\rm{max}}\) such that \(R(x, a, x') < R_{\rm{max}}\), for all possible \(x,a,x'\). We can use this fact to find a bound on \(V^\pi\):

\[V^\pi(x_k) \leq \max_{x_k, x_{k+1}, \dots} \sum_{i=0}^\infty \gamma ^i R(x_{k+i},\pi(x_{k+i}), x_{k+i+1}) \leq \sum_{i=0}^\infty \gamma ^i R_{\rm{max}} = \frac{R_{\rm{max}}}{1 - \gamma}\]

in which the final term applies for \(0 < \gamma < 1\). This expression can then be used to bound the error

\[ V^\pi(x) - \sum_{i=0}^{N_{\rm{ro}}}\gamma ^i R(x_i,\pi(xi), x_{i+1})\]

Because we have functions to sample a policy rollout and to calculate the value of a rollout, the code is simple enough:

def approximate_value_function(x1, pi, nr_samples=10, horizon=N, gamma=0.9):
    """ Approximate the value function by performing `nr_samples` rollouts
        starting from x1, and averaging the result.
    """
    rollouts = [policy_rollout(x1, pi, horizon) for _ in range(nr_samples)]
    rewards = [rollout_reward(rollout, horizon, gamma) for rollout in rollouts]
    return sum(rewards)/nr_samples

nr_samples=10
for x1, room in enumerate(vacuum.rooms):
    V_x1 = approximate_value_function(x1, reasonable_policy, nr_samples)
    print(f"V({room}) ~ {V_x1:.2f}")

V(Living Room) ~ 34.39
V(Kitchen) ~ 29.49
V(Office) ~ 16.00
V(Hallway) ~ 32.39
V(Dining Room) ~ 19.25

The above calculation was done with a horizon of \(N=5\) and 10 samples. Of course, we can use more samples and a longer horizon to obtain a much more accurate estimate:

nr_samples = 100
horizon = 50
X = VARIABLES.discrete_series('X', range(1, horizon+1), vacuum.rooms)
A = VARIABLES.discrete_series('A', range(1, horizon), vacuum.action_space)
for x1, room in enumerate(vacuum.rooms):
    V_x1 = approximate_value_function(x1, reasonable_policy, nr_samples, horizon)
    print(f"V({room}) ~ {V_x1:.2f}")

V(Living Room) ~ 99.43
V(Kitchen) ~ 97.64
V(Office) ~ 84.74

V(Hallway) ~ 97.37

V(Dining Room) ~ 85.89

3.5.8. Computing The Value Function*#

For any fixed policy we can compute the exact value of value function.

We can compute the exact value of \(V^\pi\) by solving a system of linear equations. Recall our recursive definition of the value function:

\[V^\pi(x) = \bar{R}(x,\pi(x)) + \gamma \sum_{x'} P(x'|x, \pi(x)) V^\pi(x')\]

This equation holds for every possible value \(x\) for the state. Hence, if there are \(n\) possible states, we obtain \(n\) linear equations in \(n\) unknowns. Each of these \(n\) equations is obtained by evaluating \(V^\pi(x)\) for a specific value of \(x\). Collecting the unknown \(V^\pi\) terms on the left hand side and the known \(\bar{R}(x,\pi(x))\) terms on the right hand side, we obtain

\[V^\pi(x) - \gamma \sum_{x'} P(x'|x, \pi(x)) V^\pi(x') = \bar{R}(x,\pi(x)).\]

To make this explicit yet concise for our vacuum cleaning robot example, let us define the scalar \(T^\pi_{xy}\doteq P(y|x,\pi(x))\) as the transition probability from state \(x\) to state \(y\) under policy \(\pi\). In addition, we use the abbreviations L,K,O,H, and D for the rooms, and use the shorthand \(V^\pi_x\doteq V^\pi(x)\) for the value of state \(x\) under policy \(\pi\). Using this notation, we can evaluate the above expression. For \(x = L\), we obtain

\[ V^\pi_L - \gamma \sum_{x'\in {L,K,O,H,D}} T^\pi_{Lx'} V^\pi_{x'} = \bar{R}(L,\pi(L)) \]

or, after some algebra:

\[ (1 - \gamma T^\pi_{LL}) V^\pi_{L} - \gamma T^\pi_{LK} V^\pi_{K} - \gamma T^\pi_{LO} V^\pi_{O} - \gamma T^\pi_{LH} V^\pi_{H} - \gamma T^\pi_{LD} V^\pi_{D} = \bar{R}(L,\pi(L)) \]

If we apply this same process for each of the five rooms, we obtain the following five equations:

\[\begin{split} \begin{aligned} (1 - \gamma T^\pi_{LL}) V^\pi_{L} - \gamma T^\pi_{LK} V^\pi_{K} - \gamma T^\pi_{LO} V^\pi_{O} - \gamma T^\pi_{LH} V^\pi_{H} - \gamma T^\pi_{LD} V^\pi_{D} &= \bar{R}(L,\pi(L)) \\ - \gamma T^\pi_{KL} V^\pi_{L} + (1 - \gamma T^\pi_{KK}) V^\pi_{K} - \gamma T^\pi_{KO} V^\pi_{O} - \gamma T^\pi_{KH} V^\pi_{H} - \gamma T^\pi_{KD} V^\pi_{D} &= \bar{R}(K,\pi(K)) \\ - \gamma T^\pi_{OL} V^\pi_{L} - \gamma T^\pi_{OK} V^\pi_{K} + (1 - \gamma T^\pi_{OO}) V^\pi_{O} - \gamma T^\pi_{OH} V^\pi_{H} - \gamma T^\pi_{OD} V^\pi_{D} &= \bar{R}(O,\pi(O)) \\ - \gamma T^\pi_{HL} V^\pi_{L} - \gamma T^\pi_{HK} V^\pi_{K} - \gamma T^\pi_{HO} V^\pi_{O} + (1 - \gamma T^\pi_{HH}) V^\pi_{H} - \gamma T^\pi_{HD} V^\pi_{D} &= \bar{R}(H,\pi(H)) \\ - \gamma T^\pi_{DL} V^\pi_{L} - \gamma T^\pi_{DK} V^\pi_{K} - \gamma T^\pi_{DO} V^\pi_{O} - \gamma T^\pi_{DH} V^\pi_{H} + (1 - \gamma T^\pi_{DD}) V^\pi_{D} &= \bar{R}(D,\pi(D)) \end{aligned} \end{split}\]

The unknowns in these equations are \(V^\pi_{L}, V^\pi_{K}, V^\pi_{O}, V^\pi_{H}, V^\pi_{D}\). All of the other terms are either transition probabilities or expected rewards, whose values are either given, or can easily be computed.

In code, this becomes:

def calculate_value_system(pi, gamma=0.9):
    """Calculate A, b matrix of linear system for value computation."""
    b = np.empty((5,), float)
    AA = np.empty((5,5), float)
    for x, room in enumerate(vacuum.rooms):
        a = pi[x] # action under policy
        b[x] = T[x,a] @ R[x,a] # expected reward under policy pi
        AA[x] = -gamma * T[x,a]
        AA[x,x] += 1
    return AA,b

def calculate_value_function(pi, gamma=0.9):
    """Calculate value function for given policy"""
    AA, b = calculate_value_system(pi, gamma)
    return np.linalg.solve(AA,b)

Here is the linear system for the policy reasonable_policy (which is really almost always wrong):

AA, b = calculate_value_system(reasonable_policy)
print(f"A =\n{AA}")
print(f"b = {b}")

A =
[[ 0.1  -0.   -0.   -0.   -0.  ]
 [-0.72  0.82 -0.   -0.   -0.  ]
 [-0.   -0.    0.82 -0.72 -0.  ]
 [-0.72 -0.   -0.    0.82 -0.  ]
 [-0.   -0.   -0.   -0.72  0.82]]
b = [10.  8.  0.  8.  0.]

When we calculate the value function \(V^\pi\) under the policy reasonable_policy we see that our exact value function was well approximated by the Monte Carlo estimate above:

value_for_pi = calculate_value_function(reasonable_policy)
print("V(reasonable_policy):")
for i,room in enumerate(vacuum.rooms):
    print(f"  {room:12}: {value_for_pi[i]:.2f}")

V(reasonable_policy):
  Living Room : 100.00
  Kitchen     : 97.56
  Office      : 85.66
  Hallway     : 97.56
  Dining Room : 85.66

3.5.8.1. Exercise#

Why is the value function in the living room \(100\) and not the immediate reward \(10\)?

3.5.9. Summary#

Markov Decision Processes or MDPs can be used to model decision making in a stochastic environment, albeit with complete knowledge of the state. It is a rich subject, and we introduced many new concepts in this section:

The reward function, \(R : {\cal X} \times {\cal A} \times {\cal X} \rightarrow \mathbb{R}\).
The expected reward, \(\bar{R}(x,a)\) for executing action \(a\) in state \(x\), and a corresponding greedy planning algorithm.
The utility function, \(U: {\cal A}^n \times {\cal X}^{n+1} \rightarrow \mathbb{R}\), as the sum of discounted rewards.
The notion of rollouts to approximate the expected utility of actions.
The policy \(\pi: {\cal X} \rightarrow {\cal A}\), as a mapping from states to actions.
The value function, \(V^\pi:{\cal X} \rightarrow \mathbb{R}\), associated with a given policy \(\pi\).
The use of policy rollouts to approximate the value function \(V^\pi\).
Exact calculation of the value function for a fixed policy.

We’ll leave the concepts of an optimal policy and how to compute it for the next section. There are two other extensions to MDPs that we did not cover:

Partially Observable MDPs (or POMDPS) are appropriate when we cannot directly observe the state.
Reinforcement learning, a way to learn MDP policies from data. This will be covered in the next section as well.