# 4.5. Planning for Logistics#

Our warehouse robot needs to know its location in the warehouse with more precision than the vacuuming robot.

In Chapter 3, we modeled a robot system as an MDP, and saw how to compute both the optimal value function and the corresponding optimal policy. The state space in Chapter 3 was very simple, having only five states. For our vacuuming robot, it was sufficient to know the room in which the robot was located, but this level of detail is not sufficient for our warehouse robot. In order to position itself accurately with respect to the shelves that contain inventory, the warehouse robot needs a quantitative description of its location, namely, coordinates in a warehouse coordinate frame. Therefore, for our warehouse robot, we represent the state space by a discrete grid, whose coordinates define the location of the robot. This representation has significantly more states than for the vacuuming robot. In this section, we show how to perform value iteration on such a grid, and briefly discuss the implications of estimating the state using sensor data.

## 4.5.1. Value Iteration in 2D#

It works, but can be slow.

Recall that value iteration starts with an initial guess, \(V^0\), and at each iteration we update our approximation of the value function by the update rule:

This can be *very* expensive if applied naively, when the number of states is large. Indeed, if we use the 5000-state finite element discretization of the warehouse, the conditional probability table (CPT) above contains \(5k \times 4 \times 5k = 100M\) entries.

There are ways to deal with this, computationally. Indeed, most transitions have zero probability, and hence we could store the CPT in a *sparse* multi-dimensional array instead. Alternatively, we can implement the Q value calculation as a *function*. This is what we will do below, and we will also immediately switch to using the 2D coordinates \((i,j)\) for the discretized cells:

```
goal = 35, 50
proximity = logistics.base_map + logistics.proximity_map_on
def Q_value(x, a, V):
"""Given a state x=(i,j) action a, and value image, calculate Q value."""
i, j = x
if x==goal: return 0 # only reward when *first* arriving at goal
if logistics.base_map[i,j]: return 0 # invalid state
if a == 0: next = (i,j-1) if j>0 else x # LEFT
if a == 1: next = (i,j+1) if j<99 else x # RIGHT
if a == 2: next = (i-1,j) if i>0 else x # UP
if a == 3: next = (i+1,j) if i<49 else x # DOWN
reward = 100 if next==goal else 0
r, c = next
if logistics.proximity_map_on[i,j]: reward -= 50 # in proximity is bad!
if logistics.proximity_map_on[r,c]: reward -= 50 # as is going in proximity.
return reward + 0.975 * V[r,c]
def update_V(x, V):
"""Given a state x=(i,j) and value image, calculate updated value."""
Q_x = [Q_value(x, a, V) for a in range(4)]
return max(Q_x)
```

Note above we assumed actions were *deterministic*, but the function can be easily adapted to non-deterministic actions. We awarded 100 points upon reaching the goal, but make sure that happens only once, or otherwise the value of that state will be infinite. In addition, when the state is near an obstacle (we use the `proximity_map_on`

likelihood image from Section 4.3), the agent gets punished by 100 points. Below, we do value iteration using a brute force loop, which is still remarkably fast:

```
I, J = np.meshgrid(range(50), range(100), indexing='ij')
states = np.dstack((I, J)).reshape(5000,2)
k = 0
V_k = np.full((50,100), 0.0)
error = float('inf')
while error>1e-5:
V_plus = np.empty((50,100), float)
for i,j in states:
V_plus[i,j] = update_V((i,j), V_k)
error = np.max(np.abs(V_plus-V_k))
V_k = V_plus
k += 1
print(f"after {k} iterations, error = {error}")
```

```
after 86 iterations, error = 0.0
```

```
#| caption: The converged value function.
#| label: fig:converged-value-function
px.imshow(V_k, color_continuous_scale='Turbo',origin="upper", aspect='equal')
```

Notice that the *gradient* of the value function determines the optimal policy: at any cell, we can just look around in all 4 directions and move to the cell with the largest value.

### 4.5.1.1. Thought Exercise#

Why might the algorithm terminate after 86 iterations?

## 4.5.2. The Intuition behind Value Iteration#

Costs and rewards propagate.

It is instructive to look at an animation for how the value function evolves in successive iterations of the value iteration algorithm:

The most prominent phenomenon that one can make out in the animation above is that the reward from reaching the goal state, in the middle of the warehouse, gradually propagates towards every location. It gradually decreases in value because of the discount factor. It also does not penetrate the obstacles in the map, because those states simply have a value of zero.

There is a deep connection between value iteration and the all-source-shortest-path algorithm that you might have seen in an algorithms class. Deterministic actions set up a graph, and while in this case we have no penalty to moving in the the graph, the discount factor plays the same role in practice. In fact, if we set the discount factor to zero and give actions a cost, both algorithms will return identical results.

## 4.5.3. Closing the Loop#

Adding a state estimator closes the sense-think-act loop.

When the robot acts in the world, it senses, estimates its state, and executes a policy. We can take any of the algorithms from the last section and use them to estimate the robot’s current state. The value function gives us a policy, which the robot can then execute. After that, repeat.

To calculate the policy, we should take into account the uncertainty in actions. The policy we calculated above for the warehouse may not be optimal, because we assumed deterministic actions in the calculation of the Q values. In Section 3.5 we saw how to account for uncertain, Gaussian actions. We could do the same here to potentially obtain a better policy. In particular, the robot will be more cautious and stay even further away from obstacles, since the robot does not always perfectly execute commands.

### 4.5.3.1. Thought Exercise#

Estimating a single location for the robot is not always possible: posterior densities can be *multi-modal*, and so do not know any better than that we are in one of a number of possible locations. What should we do in that case?

## 4.5.4. Partially Observable MDPs#

POMDPs facilitate planning to sense.

To *really* calculate the optimal policy, we would need to know how future actions
will affect sensing. Even when we take into account the uncertainty in the actions, there is something we have not taken into account: uncertain *sensing*. Indeed, when calculating the optimal policy we have assumed that we have perfect knowledge of the state. The probabilistic model that takes into account uncertain sensing is called a **partially observable Markov decision process**, or POMDP.

While out of scope for this book, it is important to be aware of this limitation of the MDP framework. A POMDP policy could improve on this by planning to sense: for example, the robot could deliberately move closer to an RFID beacon on purpose, to re-localize the robot when it is not very certain about its location.

POMDPs are computationally incredibly expensive, because the state is no longer simply *where* we are, but also *what we believe* the state is. Hence, POMDPs are about planning in the *belief space*, which is a very high-dimensional concept.