Chapter Summary

2.7. Chapter Summary#

Splash image with heroic trash-sorting robot in sunset light

In this chapter, we used the example of a trash sorting robot to introduce key concepts from probability theory, decision theory, and even machine learning, and showed how these concepts can be used to begin building robotic systems that operate in real world settings, where significant uncertainty may exist.

2.7.1. Models#

Probability theory provides the mathematical foundation for dealing with uncertainties that confront robotic systems. In this chapter, we used probability to reason about the uncertain category of an object in the robot’s work space. We began by introducing probability distributions, which assign probability values to events (i.e., to subsets of the set of all possible outcomes). In the case of discrete outcomes, such as our five categories of trash, these probability distributions are called probability mass functions or PMFs, and they assign finite probability to each possible category. When the set of outcomes is continuous, such as the scale we used to measure an object’s weight, these probability distributions are called probability density functions or PDFs, and they assign probability values to subsets of real numbers, for example, to intervals of the form \([a,b]\), and this probability value can be computed by integrating the PDF over the interval. Interestingly, for the continuous case, zero probability is assigned to any individual outcome, since this would correspond to an interval of the form \([a,a]\), and for any continuous function \(f\), we have \(\int_a^a f(u)du = 0\).

While probability distributions give some insight into the behavior of individual random outcomes, in robotics we are often interested in average behavior over long periods of time. For the example of our trash sorting robot, we might like to characterize the average performance of the system over days, or even weeks. Expectation is a property of probability distributions that gives insight into average behavior over many trials. The expected value of a discrete random variable is given by

(2.50)#\[\begin{equation} E[X] = \sum_{i=1}^n x_i p_X(x_i) \end{equation}\]

This expression bears strong resemblance to the equation for the weighted average of a set of data values, and this is no coincidence. In fact, this is the essence of the weak law of large numbers, which says that, as the number of data points goes to infinity, the average of a set of data points will approach the expected value of the probability distribution from which the data were drawn.

When there are multiple sources of uncertainty in the world, we can use joint probability distributions to characterize the stochastic relationships between different random quantities. For example, what is the probability that a piece of trash will conduct electricity and simultaneously weigh between \(1.5\) and \(3.0\) kg? A joint PMF for two discrete random variables \(X\) and \(Z\) is denoted by \(p_{XZ}(x,z)\), and is defined by \(p_{XZ}(x,z) = P(X=x, Z=z)\). (Recall that we use upper case letters to denote random variables, and lower case letters to denote possible values taken by these random variables.) This can be extended to any number of random variables, so that, in theory, all uncertainties in the world could be modeled by a single joint probability distribution. However, specifying a complete joint probability distribution is exceedingly expensive. Consider that if we have \(n\) random variables which take on \(N_1, N_2,\dots N_n\) possible values, respectively, the size of a table to represent the joint probability distribution would be \(N_1 \times N_2 \times \dots N_n\), i.e., the size of this data structure grows exponentially with the number of random variables.

Happily, for most real-world scenarios, we really don’t need the complete joint probability distribution. The reason for this is that the interactions between random quantities often have a fairly local kind of behavior. Because of this property, we can often capture the essential interactions among random variables using conditional probability distributions, which capture the individual stochastic relationships between smaller sets of random variables. For example, we can describe the conditional probability that an object is scrap metal if it conducts electricity, or, conversely, the probability that an object will conduct electricity if it is scrap metal. Conditional probability distributions are typically used to model sensor behavior; the conditional probability \(P(Z| X=x)\) denotes the probability distribution for random variable \(Z\) given that the random variable \(X\) takes the value \(x\). This kind of model is sometimes called a forward sensor model, since it models the behavior of the sensor given the state of the world. Note that the conditional distribution \(P(Z | X=x)\) is itself a valid probability distribution. For example, if \(Z\) is a discrete random variable, we would have \(\sum_i P(Z=z_i | X=x) =1\).

We also saw how probabilistic models can be incorporated into computational solutions via sampling. In particular, we developed an algorithm that uses the cumulative distribution function to generate samples for a given particular probability distribution. We used this sampling algorithm to investigate the empirical relationship between outcomes generated from a probability distribution and the expected value of the distribution.

2.7.2. Reasoning#

While the various kinds of probability models nicely describe the stochastic nature of the world, it is not immediately obvious how they could be used to make inferences about the world, or as a basis for decision-making in an uncertain world. The key idea that enables this kind of reasoning is captured in Bayes’ theorem:

(2.51)#\[\begin{equation} P(X|Z=z)=\frac{P(Z=z|X)P(X)}{P(Z=z)} \end{equation} \]

Note that the forward model, \(P(Z=z|X)\) appears on the right hand side, and is used to compute the inverse model \(P(X | Z=z)\). For example, if we are given the value of a sensor reading, \(Z\), Bayes’ theorem can be used to compute the probability associated to the state \(X\). For this reason, Bayes’ theorem is sometimes referred to as an inversion theorem.

We can use Bayes’ theorem to compute the maximum a posteriori estimate (or MAP estimate) for the value of \(X\) given sensor reading \(Z =z\) by solving the optimization problem:

(2.52)#\[\begin{equation} x^*_{MAP} = \arg \max_x P(x|z) \end{equation}\]

Note that this computation requires that we have access to the prior probability distribution \(P(X)\). In cases where this is not available, we may instead choose to compute the maximum likelihood estimate (or MLE), which is given by

(2.53)#\[\begin{equation} x^*_{MLE} = \arg \max_x L(x;z) = \arg \max_x P(Z=z|X) \end{equation}\]

in which \(L(x;z) = P(Z=z|X)\) is called the likelihood. Note that the likelihood is not a valid probability distribution, since typically \(\sum_i P(Z=z | X=x_i) \neq 1\).

2.7.3. Background and History#

The origins of probability theory can be traced back to games of chance in ancient societies, but the first real attempts to formalize the study of probability came during the Renaissance, in the works of mathematicians such as Cardano, Pascal, and Fermat. These early mathematical approaches focused mainly on games of chance, with a strong empirical flavor. The line between statistics and probability theory was a blurry one in Renaissance times.

It was Bayes, in the eighteenth century, who pioneered the idea of using evidence, together with ideas from probability theory, to draw inferences. While what we now know as Bayes’ theorem is a general result that does not depend on the specific probability distributions under consideration, Bayes studied the specific case of inferring the parameter of a binomial distribution given observed outcomes. The more general development is due largely to Laplace, in the years following the death of Bayes.

What we think of today as probability theory was formalized in the early 1930’s by Kolmogorov. It was Kolmogorov who formulated the three axioms that form the basis for modern probability theory:

For any event \(A\), \(P(A) \geq 0\).
\(P(\Omega)=1\).
For disjoint events \(A\) and \(B\), \(P(A\cup B) = P(A) + P(B)\).

Equipped with these three axioms and a background in real analysis, one can derive most all of the important results that comprise modern probability theory. The Renaissance mathematicians were interested in understanding random phenomena. Kolmogorov, a Russian mathematician, was interested in establishing a rigorous theoretical foundation for probability theory.

One of the best recent books we have found useful is the book “Introduction to Probability for Data Science” [Chan, 2023]. The classic reference for statistical reasoning, including maximum likelihood estimation and Bayesian decision theory is [Duda et al., 2012]. Anders Hald has written two volumes on the history of probability theory, one that covers the period from Bernoulli, De Moivre and Laplace to the mid-twentieth century, with particular attention to estimation problems [Hald, 1998], and one that covers developments prior to 1750 [Hald, 2003].

Chapter Summary

Contents

2.7. Chapter Summary#

2.7.1. Models#

2.7.2. Reasoning#

2.7.3. Background and History#