# 3.7. Chapter Summary#

In Chapter 2, we considered actions and sensing operations that were executed at a specific moment in time, having no dependence or interaction with either future or past actions or observations. In this chapter, we considered the more realistic, and more interesting, case, in which the interdependence between sequences of actions and observations can exploited via probabilistic inference. This led to the introduction of much of the essential machinery that underlies robotics perception and planning, and that will be developed throughout the remainder of the book.

## 3.7.1. Models#

The key modeling tool that we introduced is the language of Bayesian networks. Bayes nets are a type of graphical model named after reverend Bayes, who also gave his name to the practice of interpreting probability distributions as *beliefs* rather than observed frequencies.

A foundational Bayes net family is the Markov chain, modeling a state evolving over time. State transitions are modeled using a conditional distribution of the current state, conditioned on the previous state, invoking the Markov property. A Markov chain is then a sequence of such transition conditionals, and this in itself is an incredibly powerful idea. Typically, these transitions are *stationary*, in that they do not depend on time, but we have introduced *controlled* Markov chains to capture the dependence of the state on actions. This is an essential tool in robotics: modeling how the state of the robot and/or its environment changes -probabilistically- as a result of the robot’s actions.

Unrolling a Bayes net fragment over time was then generalized to the concept of Dynamic Bayes Nets (DBNs), allowing us to also model noisy sensors measuring aspects of the state. The DBN framework allowed us to precisely capture the fact that these measurements only depend on the current state, and that the states themselves are Markov, conditioned on a given action. These two modeling assumptions, coupled with reasoning in a discrete time-setting, are so universal in robotics that the particular action-state-measurement Dynamic Bayes Net from Section 3.3 can be used to describe almost all robot modeling approaches.

One of the key properties of Dynamic Bayes nets is that they are easy to sample from, using ancestral sampling, which allows us to simulate what a given sequence of actions might accomplish, and what we would measure in these hypothetical roll-outs. This is used to illustrate, or even power, various perception and planning algorithms, which we discuss below.

## 3.7.2. Reasoning#

Bayes nets are great for *modeling*, and in Section 3.4 we introduced Hidden Markov Models that allow us to reason about a sequence of hidden states, observed via noisy measurements. Hidden Markov models have been around for a long time and transformed areas such as speech recognition. They are exactly what we need for robot localization over time, as well. Beyond the simple vacuum cleaning robot example, they can be generalized to nearly any robot/environment combo that we can model using discrete states transitioning over time. In our example we use just a single discrete sensor, but the HMM is able to accommodate multiple sensors, even continuous ones.

We then discussed efficient inference using factor graphs. Naive inference, be it maximum a posteriori (MAP) or computing the full posterior over the hidden states is exponential in complexity. Hence, we introduced a new graphical formalism, **factor graphs**, that focus only on the hidden variables, capturing the effect of priors, transitions, and measurements - assumed given - as **factors**.
We showed how an HMM can easily be converted to a factor graph, but this generalizes really to any dynamic Bayes net with given actions and measurements. We then sketched out - for an HMM - the max-product and sum-product algorithms, which respectively compute the MAP estimate and the full posterior, with time complexity that is *linear* in the number of time steps. Such is the power of dynamic programming.

In Section 3.5, we introduced Markov Decision Processes or MDPs, which we used to model decision making in a stochastic environment, albeit with complete knowledge of the state. This is a rich subject, and we introduced many new concepts, including reward, utility functions, rollouts, and policies and their value.

Finally, we considered learning optimal policies in Section 3.6. After deriving the Bellman equation we discussed two algorithms to compute the optimal policy/value function: policy iteration and value iteration. However, both of these assume full knowledged of the world model (transition probabilities between states) and reward structure. In a model-based method we try to estimate these from experience gathered by exploring the environment. However, a more direct method focused on estimating the action values (Q-values) is Q-learning, which is an example of a model-free method. Finally, we discussed how to balance exploitation and exploration to avoid getting stuck in local corners of the policy space.

## 3.7.3. Background and History#

Markov chains date back to -you guessed it- Andrey Markov who used them to study, among other things, the statistics of language. In fact, attesting to the importance and generality of the concept, any finite-context large language model can be viewed as a Markov chain - admittedly with a rather vast state space.

In this book, we use factor graphs as the organizing structure for probabilistic inference. In later chapters we will expand their use to continuous variables, and will see that factor graphs aptly describe the independence assumptions and sparse nature of the large nonlinear least-squares problems that arise in robotics. But their usefulness extends far beyond that: they are at the core of the sparse linear solvers we use as building blocks; they clearly show the nature of filtering and incremental inference; and they lead naturally to distributed and/or parallel versions of robotics. A deeper dive into factor graphs, including some historical notes and their use in robotics, can be found in the review papers [Dellaert and Kaess, 2017] and [Dellaert, 2021].

Hidden Markov Models (HMMs) were introduced in 1966 by Baum and Petrie [1966], and really broke through when they were applied in the speech recognition work by IMB [Jelinek *et al.*, 1975].
The dynamic programming algorithms we introduced for HMMs are actually special cases for linear, chain-structured factor graphs.
In the HMM literature, they are known as the Viterbi and Forward-Backward algorithms, respectively. However, because at every step, one variable is eliminated from the maximization, both max-product and sum-product algorithms are in fact instances of the **elimination algorithm** (see again the review paper by Dellaert and Kaess [2017]), which works for *arbitrary* factor graphs, albeit not necessarily in \(O(n)\) time.

The *principle of optimality* was stated by Bellman in a
seminal 1960 article in the IEEE Transactions on Automatic Control [Bellman and Kalaba, 1960].
One of the most respected contemporary sources for Markov decision processes reinforcement learning is the book by Sutton and Barto [Sutton and Barto, 2018], which was recently updated to a second edition. Q-learning dates from a Ph.D. thesis by Chris Watkins [Watkins, 1989].
The seminal book by Pearl [Pearl, 1988] introduced general Bayes nets for probabilistic
reasoning to the artificial intelligence community.
A modern comprehensive treatment of graphical models for probabilistic inference is
provided in [Koller and Friedman, 2009].