# 3.6. Learning to Act Optimally#

Learning to act optimally in a stochastic world.

When a Markov Decision Process is fully specified we can *compute* an optimal policy.
Below we first define optimal value functions and examine their properties, most notably the Bellman equation.
We then discuss value iteration and policy iteration, two algorithms to calculate the optimal value function and its associated optimal policy. However, both these algorithms need a fully-defined MDP.

When the MDP is not known in advance, however, we have to *learn* an optimal policy over time. There are two main approaches: model-based and model-free.

## 3.6.1. The Optimal Value Function#

The optimal policy maximizes the value function.

We now turn our attention to defining the *optimal* value function,
which can be used to construct the **optimal policy** \(\pi^*\).
From Section 3.5 we know how to compute the value function for an arbitrary policy \(\pi\):

To begin, we recall the famous **principle of optimality**
as stated by Bellman in a
1960 article in the IEEE Transactions on Automatic Control [Bellman and Kalaba, 1960]:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

This principle enables a key step in deriving a recursive formulation for the optimal policy. Indeed, the **optimal value function** \(V^*: {\cal X} \rightarrow {\cal A}\)
is merely the value function for the optimal policy.
This can be written mathematically as

In the above, the second line follows immediately by using the definition of \(V^\pi\) above. The third line is more interesting.
By applying the principle of optimality, we replace \(V^\pi(x')\) with \(V^*(x')\).
Simply put, if remaining decisions from state \(x'\) must constitute an optimal policy,
the corresponding value function at \(x'\) will be the optimal value function for \(x'\).
For the fourth line,
because the value function has been written in recursive form,
\(\pi\) is applied only to the current state (i.e., when \(\pi\) is evaluated in the optimization,
it always appears as \(\pi(x)\)).
Therefore, we can write the optimization
as a maximization with respect to the *action* applied in the *current state*, rather than as a
maximization with respect to the entire policy \(\pi\)!

This equation is known as the **Bellman equation**.
It is named after Richard Bellman, the mathematician
who discovered it, and it is one of the most important equations in all of computer science.
The Bellman equation has a very nice interpretation:
the optimal value function of a state is the maximum expected reward
*plus* the discounted expected value function when acting optimally in the future.

## 3.6.2. Action Values and the Optimal Policy#

Using Bellman’s equation, it is straightforward to compute the optimal policy \(\pi^*\) from a given state \(x\):

This computation is performed so often that it is convenient to introduce the so-called **\(Q\)-function**, which is the value of being in state \(x\) and taking action \(a\), for a given value function \(V\):

Another name for Q-values is **action values**, to be contrasted with the state values, i.e., the value function \(V(x)\). They allow us to write the optimal policy \(\pi^*(x)\) simply as picking, for any given state \(x\), the action \(a\) with the highest action value \(Q(x,a; V^*)\) computed from the optimal value function \(V^*\):

We will use \(Q\)-values in many of the algorithms in this section, and an efficient way to compute a Q-value from a value function is given below:

```
def Q_value(V, x, a, gamma=0.9):
"""Calculate Q(x,a) from given value function"""
return T[x,a] @ (R[x,a] + gamma * V)
```

A very efficient way to compute all Q-values for all state-action pairs at once, using `numpy`

, is

```
Q = np.sum(T * (R + gamma * V), axis=2)
```

which we will also use below. It yields a matrix of size \(|X| \times |A|\).

### 3.6.2.1. Exercise#

Try to understand the function

`Q_value`

above for calculating the Q-values. Use the notebook to investigate the calculation for specific values of \(x\) and \(a\).Similarly, try to understand the “vectorized” form above that yields the entire table of Q-values at once.

## 3.6.3. Policy Iteration#

By iteratively improving an estimate of the optimal policy, we eventually find \(\pi^*\).

We will describe two methods for determining the optimal policy. The method we describe below, policy iteration, iteratively improves candidate policies, ultimately converging to the optimal policy \(\pi^*\). The second method, value iteration, iteratively improves an estimate of \(V^*\), ultimately converging to the optimal value function. Both, however, need access to the MDP’s transition probabilities and the reward function.

**Policy Iteration** starts with an initial guess at the optimal policy, and then iteratively improves our guess until no further improvements are possible.
In particular, policy iteration generates a sequence of policies
\(\pi^0, \pi^1, \dots \pi^n\), such that \(\pi^{i+1}\) is better than policy \(\pi^i\).
This process ends when no further improvement is possible, which
occurs when \(\pi^{i+1} = \pi^i.\)

To improve the policy \(\pi^i\), we update the action chosen *for each state* by applying
Bellman’s equation using \(\pi^i\) in place of \(\pi^*\).
This can be achieved with the following algorithm:

Start with a random policy \(\pi^0\) and \(i=0\), and repeat until convergence:

Compute the value function \(V^{\pi^i}\)

Improve the policy for each state \(x \in {\cal X}\) using the update rule:

Increment \(i\)

Notice that this algorithm has the side benefit of computing successively better approximations to the value function at each iteration. Because there are a finite number of actions that can be applied in each state, there are only finitely many ways to update a policy. Therefore, we expect this policy iteration algorithm to converge in finite time.

We already know how to do step (1) above, using the`calculate_value_function`

.
The second step of the algorithm is easily implemented with the following code:

```
def update_policy(value_function):
"""Update policy given a value function"""
new_policy = [None for _ in range(5)]
for x, room in enumerate(vacuum.rooms):
Q_values = [Q_value(value_function, x, a) for a in range(4)]
new_policy[x] = np.argmax(Q_values)
return new_policy
```

The whole policy iteration algorithm then simply iterates these until the policy no longer changes. If no initial policy is given, we can start with a zero value function \(V^{\pi^0}(x) = 0\) for all \(x\):

```
def policy_iteration(pi=None, max_iterations=100):
"""Do policy iteration, starting from policy `pi`."""
for _ in range(max_iterations):
value_for_pi = vacuum.calculate_value_function(R, T, pi) if pi is not None else np.zeros((5,))
new_policy = update_policy(value_for_pi)
if new_policy == pi:
return pi, value_for_pi
pi = new_policy
raise RuntimeError("No stable policy found after {max_iterations} iterations")
```

On the other hand, if we have a guess for the initial policy, we can initialize
\(\pi^0\) accordingly.
For example, we can start with a not-so-smart `always_right`

policy:

```
RIGHT = vacuum.action_space.index("R")
always_right = [RIGHT, RIGHT, RIGHT, RIGHT, RIGHT]
```

```
optimal_policy, optimal_value_function = policy_iteration(always_right)
print([vacuum.action_space[a] for a in optimal_policy])
```

```
['L', 'L', 'R', 'U', 'U']
```

Starting with the `always_right`

policy, our policy iteration algorithm converges to an
intuitively pleasing policy.
In the dining room and kitchen we go `left`

, in the office we go `right`

, and in the hallway and dining room we go `up`

.
This is significantly different from the `always_right`

policy (which might be better named `almost_always_wrong`

).
In fact, it is exactly the `reasonable_policy`

that we created in Section 3.5.
We already knew that it should be pretty good at getting to the living room as fast as possible. In fact, it is optimal!

We also print out the optimal value function below, which shows that if we are close to the living room the value function is very high, but it is a bit lower in the office in the dining room:

```
for i,room in enumerate(vacuum.rooms):
print(f" {room:12}: {optimal_value_function[i]:.2f}")
```

```
Living Room : 100.00
Kitchen : 97.56
Office : 85.66
Hallway : 97.56
Dining Room : 85.66
```

The optimal policy is also obtained when we start without a policy, starting with a zero value function:

```
optimal_policy, _ = policy_iteration()
print([vacuum.action_space[a] for a in optimal_policy])
```

```
['L', 'L', 'R', 'U', 'U']
```

## 3.6.4. Value Iteration#

Dynamic programming can be used to obtain the optimal value function.

Let us restate Bellman’s equation, which must hold for each state \(x\):

If we have \(n\) states, and since we would then have \(n\) equations, it seems like we should be able to solve for the \(n\) unknown values \(V^*(x)\).
Sadly, they are not *linear* equations, as the maximization operation is not linear. Hence, unlike the case when the policy is fixed, we cannot just solve a system of linear equations to recover \(V^*\).

**Value iteration** approximates \(V^*\) by constructing a sequence of estimates,
\(V^0, V^1, \dots , V^n\) that converges to \(V^*\).
Starting with an initial guess, \(V^0\), at each iteration we update
our approximation of the value function for each state by the update rule:

Notice that the right hand side includes two terms:
the expected reward (which we can compute exactly), and a term in \(V^i\) (our current best guess at the value function).
Value iteration operates by iteratively using our *current best guess* \(V^i\) along with the *known* expected reward to update the approximation.
Unlike policy iteration, we do not expect value iteration to converge to the exact result in finite time.
Therefore, we cannot use \(V^{i+1} = V^i\) as our termination condition.
Instead, we often use a condition such as \(|V^{i+1} - V^i| < \epsilon\), for some small value of \(\epsilon\)
as the termination condition.

Finally, note that we can once again use the Q-values to obtain a concise description for the value update:

In code, this is actually easier than policy iteration, using the concise vectorized Q-table update we discussed above:

```
V_k = np.full((5,), 100)
for k in range(10):
Q_k = np.sum(T * (R + 0.9 * V_k), axis=2) # 5 x 4
V_k = np.max(Q_k, axis=1) # max over actions
print(np.round(V_k,2))
```

```
[100. 98. 90. 98. 90.]
[100. 97.64 86.76 97.64 86.76]
[100. 97.58 85.92 97.58 85.92]
[100. 97.56 85.72 97.56 85.72]
[100. 97.56 85.68 97.56 85.68]
[100. 97.56 85.67 97.56 85.67]
[100. 97.56 85.66 97.56 85.66]
[100. 97.56 85.66 97.56 85.66]
[100. 97.56 85.66 97.56 85.66]
[100. 97.56 85.66 97.56 85.66]
```

Compare with optimal value function:

```
print(np.round(optimal_value_function, 2))
```

```
[100. 97.56 85.66 97.56 85.66]
```

And we can easily *extract* the optimal policy:

```
Q_k = np.sum(T * (R + 0.9 * V_k), axis=2)
pi_k = np.argmax(Q_k, axis=1)
print(f"policy = {pi_k}")
print([vacuum.action_space[a] for a in pi_k])
```

```
policy = [0 0 1 2 0]
['L', 'L', 'R', 'U', 'L']
```

### 3.6.4.1. Exercise#

Above we initialized the value function at 100 everywhere. Examine the effect on convergence of initializing it differently.

Implement a convergence criterion that stops the iterations after convergence.

## 3.6.5. Model-based Reinforcement Learning#

Just explore, then solve the MDP.

We can attempt to *learn* the MDP and then solve it. Both policy and value iteration require access to the transition probabilities \(T\) and the reward function \(R\). However, when faced with a new environment, we might not know how our robot will behave. And likewise, we might not have access to the reward function: how can we know in advance where we will find pots of gold?

One way to learn the MDP is to randomly explore. Let’s adapt the `policy_rollout`

code from the previous section to generate a whole lot of *experiences* of the form \((x,a,x',r)\).

```
def explore_randomly(x1, horizon=N):
"""Roll out states given a random policy, for given horizon."""
data = []
x = x1
for _ in range(1, horizon):
a = np.random.choice(4)
next_state_distribution = gtsam.DiscreteDistribution(X[1], T[x, a])
x_prime = next_state_distribution.sample()
data.append((x, a, x_prime, R[x, a, x_prime]))
x = x_prime
return data
```

Let us use it to create 499 experiences and show the first 5:

```
data = explore_randomly(vacuum.rooms.index("Living Room"), horizon=500)
print(data[:5])
```

```
[(0, 2, 0, 10.0), (0, 1, 0, 10.0), (0, 1, 1, 0.0), (1, 1, 1, 0.0), (1, 1, 1, 0.0)]
```

We can *estimate* the transition probabilities \(T\) and reward table \(R\) from the data, and then we can use the algorithms from before to calculate the value function and/or optimal policy.

The math is just a variant of what we saw in the learning section of the last chapter. The rewards are the easiest to estimate:

where \(N(x,a,x')\) counts how many times an experience \((x,a,x')\) was recorded. The transition probabilities are a bit trickier:

where \(N(x,a)=\sum_{x'} N(x,a,x')\) is the number of times we took action \(a\) in a state \(x\).

The code associated with that is fairly simple, modulo some numpy trickery to deal with division by zero and *broadcasting* the division:

```
R_sum = np.zeros((5, 4, 5), float)
T_count = np.zeros((5, 4, 5), float)
count = np.zeros((5, 4), int)
for x, a, x_prime, r in data:
R_sum[x, a, x_prime] += r
T_count[x, a, x_prime] += 1
R_estimate = np.divide(R_sum, T_count, where=T_count!=0)
xa_count = np.sum(T_count, axis=2)
T_estimate = T_count/np.expand_dims(xa_count, axis=-1)
```

Above `T_count`

corresponds to \(N(x,a,x')\), and the variable `xa_count`

is \(N(x,a)\). It is good to check the latter to see whether our experiences were more or less representative, i.e., visited all state-action pairs:

```
xa_count
```

```
array([[21., 27., 14., 18.],
[25., 33., 24., 32.],
[22., 29., 22., 18.],
[25., 19., 25., 23.],
[35., 29., 31., 27.]])
```

This seems pretty good. If not, we can always gather more data, which we encourage you to experiment with.

We can compare the ground truth transition probabilities \(T\) with the estimated transition probabilities \(\hat{T}\), e.g., for the living room:

```
print(f"ground truth:\n{T[0]}")
print(f"estimate:\n{np.round(T_estimate[0],2)}")
```

```
ground truth:
[[1. 0. 0. 0. 0. ]
[0.2 0.8 0. 0. 0. ]
[1. 0. 0. 0. 0. ]
[0.2 0. 0. 0.8 0. ]]
estimate:
[[1. 0. 0. 0. 0. ]
[0.26 0.74 0. 0. 0. ]
[1. 0. 0. 0. 0. ]
[0.22 0. 0. 0.78 0. ]]
```

Not bad. And for the rewards:

```
print(f"ground truth:\n{R[0]}")
print(f"estimate:\n{np.round(R_estimate[0],2)}")
```

```
ground truth:
[[10. 0. 0. 0. 0.]
[10. 0. 0. 0. 0.]
[10. 0. 0. 0. 0.]
[10. 0. 0. 0. 0.]]
estimate:
[[10. 0. 77.1 87.8 77.1]
[10. 0. 77.1 87.8 77.1]
[10. 87.8 77.1 87.8 77.1]
[10. 87.8 77.1 0. 77.1]]
```

In summary, learning in this context can simply be done by gathering lots of experiences, and estimating models for how the world behaves. After that, you can use either policy or value iteration to recover the optimal policy.

## 3.6.6. Model-free Reinforcement Learning#

All you need is Q.

A different, model-free approach is **Q-learning**. In the above we tried to *model* the world by trying estimate the (large) transition and reward tables. However, remember from the previous section that there is a much smaller table of Q-values \(Q(x,a)\) that also allow us to act optimally. This is because we can calculate the optimal policy \(\pi^*(x)\) from the optimal Q-values \(Q^*(x,a) \doteq Q(x, a; V^*)\):

This begs the question whether we can simply learn the Q-values instead, which might be more *sample-efficient*. In other words, we would get more accurate values with less training data, as we have less quantities to estimate.

To do this, recall that the Bellman equation can be written as

allowing us to rewrite the Q-values from above as

This gives us a way to estimate the Q-values, as we can approximate the above using a Monte Carlo estimate, summing over our experiences:

Unfortunately the estimate above *depends* on the optimal Q-values. Hence, the final Q-learning algorithm applies this estimate gradually, by “alpha-blending” between old and new estimates, which also averages over the reward:

In code:

```
alpha = 0.5 # learning rate
gamma = 0.9 # discount factor
Q = np.zeros((5, 4), float)
for x, a, x_prime, r in data:
old_Q_estimate = Q[x,a]
new_Q_estimate = r + gamma * np.max(Q[x_prime])
Q[x, a] = (1.0-alpha) * old_Q_estimate + alpha * new_Q_estimate
print(Q)
```

```
[[71.10760673 56.51571804 71.86947021 61.15575665]
[64.30420026 56.76864697 55.52510742 50.28349699]
[34.85384007 54.23063203 44.37379727 34.21037444]
[44.46931243 50.62645036 70.02940026 58.47624531]
[58.53277203 49.65672609 51.67363614 46.63022521]]
```

These values are not yet quite accurate, as you can ascertain yourself by changing the number of experiences above, but note that an optimal policy can be achieved before we even converge.

### 3.6.6.1. Exploration vs Exploitation#

The above assumed that we gather data by acting *randomly*, but that might be very inefficient. Indeed, we might be spending a lot of time - literally - bumping our heads into the walls. A better idea might be to act randomly at first (exploration), but as time progresses, spend more and more time refining the optimal policy by trying to act optimally (exploitation).

Greedy action selection can lead to bad learning outcomes. We will use Q-learning as an example, but similar problems exist for other reinforcement learning methods. During Q-learning, upon reaching a state \(x\), the **greedy action selection** method is to simply pick the action \(a^*\) according to the *current* estimate of the Q-values:

Unfortunately, this tends to often lead to Q-learning getting stuck in local minima of the policy search space: state-action pairs that might be more promising are never visited as their correct (higher) Q-values have not been estimated correctly, so they always get passed over.

Epsilon-greedy or \(\epsilon\)-greedy methods balance exploration with exploitation while learning. Instead of always choosing the best possible action according to the current estimate, we could simply choose an action at random a fraction of the time, say with probability \(\epsilon\). This is the **epsilon-greedy** method. Typical values for \(\epsilon\) are 0.01 or even 0.1, i.e., 10% of the time we choose to act randomly. Schemes also exist to decrease \(\epsilon\) over time.

### 3.6.6.2. Exercise#

Think about how to apply \(\epsilon\)-greedy methods in the model-based reinforcement learning method we discussed above.

## 3.6.7. Summary#

We discussed

The optimal policy and value function, governed by the Bellman equation.

Two algorithms to compute those: policy iteration and value iteration.

A model-based method to learn from experience.

A model-free method, Q-learning, that updates the action values.

Balancing exploitation and exploration.

The field of reinforcement learning is much richer, and we will return to it several times throughout this book.