# 1.3. The Mathematics of Robotics#

Probability theory and statistics, linear algebra, and optimization are key mathematical tools for modern robotics.

We use three different broad areas of mathematics in this book. One is *probability theory and statistics*, the next is *linear algebra*, and finally, we rely extensively on *optimization methods*. In this section, we discuss each of them in turn. We also preview how graphical models like *Bayes Networks* and *Factor Graphs* will be used in each of these areas as convenient representations. Since graph theory is also a branch of mathematics, these can be seen as a fourth, over-arching mathematical theme.

## 1.3.1. Probability Theory and Statistics#

Statisticians study data. Probability theorists understand why that’s a good idea.

It is often the case that robots do not have access to a complete and correct model of the environment. If sensors are used to determine the world state, there will invariably be errors and uncertainties associated to the sensor measurements. For example, the trash sorting robot has several sensors, all of which report data that is corrupted by noise, and these measurements are therefore subject to multiple possible interpretations. Furthermore, the effects of the robot’s own actions might be uncertain, as is the case with our vacuum cleaning robot, which sometimes fails to arrive at its desired destination, or with the logistics robot, whose motion is affected by factors such as variable friction of the factory floor and variation in motor performance.

*Probability theory* provides a rigorous mathematical framework for modeling and reasoning about uncertainty. Uncertain quantities are characterized by *probability distributions* that characterize the likelihood of various possibilities, such as the probability that a piece of trash will be made of glass vs. metal. *Conditional probability distributions* describe uncertainty in the relationships among various quantities, such as the actual weight of an object and the weight returned by a sensor (as for the trash sorting robot), or the distance traveled by a robot and the commanded motion (for our logistics robot). The concept of *expectation* quantifies what we would expect to see *on average* after many trials or many observations. For example, if we command the logistics robot to move forward by one meter and repeat this many times, what would be the average distance moved by the robot at each step.

In addition, probability theory provides sound inference tools, such as *Bayes’ Theorem*, which relates the probability of an outcome given the occurrence of particular evidence to the probabilities associated with the evidence, the outcome itself, and the conditional relationship between the two. Bayes’ theorem is at the heart of the Bayes filter, which we introduce in Chapter 4 to solve the localization problem, and is one of the most important tools used by roboticists to deal with uncertainty.

Probability distributions are among the most important mathematical objects in robotics. Unfortunately, for many problems the probability distributions are too complex to represent exactly, and mathematically manipulating these distributions (e.g., propagating uncertainty forward through a sequence of actions) is computationally intractable. In such cases, it is convenient to approximate the probability distribution using *sampling*.
In particular, we can represent a probability distribution by a set of samples, each with an associated weight.

Sampling-based methods require the ability to generate samples from a target probability distribution. Most all programming languages include a random number generator of some sort. The most common of these will return a sample from the uniform distribution on the unit interval, however functions that generate samples from Gaussian distributions are also commonly available. For other kinds of distributions, it is possible to write a customized random sample generator using methods that we will introduce in Chapter 2. Given the ability to generate random samples from arbitrary distributions, it is a simple matter to generate sample trajectories for stochastic systems.

We can also use samples to directly represent the probability distribution of the robot state as the robot moves through the world. Rather than compute the exact state probability distribution for the given motion and measurement uncertainty models, we can generate samples from both the motion and measurement probability distributions, and use these to update parameters associated to a set of state estimates. This method, which we introduce in Chapter 4 to solve the localization problem for our logistics robot, is referred to as *particle filtering*; each state estimate corresponds to a sample, which can be thought of as a particle, and the sample’s associated weight is an estimate of the probability mass assigned to a specific value of the state (or to a specific small region in the state space).

We can also use sampling to build paths and trajectories for robots. For example, in Chapter 5 we introduce *Probabilistic Road Maps*(*PRMs*) and *Rapidly-Exploring Random Trees* (*RRTs*) to find collision-free paths. For both PRMs and RRTs, path planning is reduced to the problem of generating random samples in the configuration space, and connecting these with appropriate, local paths. In the original versions of these algorithms, samples were drawn from a uniform distribution, but over the years, many alternative sampling schemes have been proposed.

*Statistics* is concerned with the analysis and interpretation of data. For example, if we measure the weight of 100 pieces of cardboard trash, what is the average weight. If the vacuum cleaning robot attempts to navigate from the living room to the kitchen several times, how often does it succeed. If a LIDAR sensor returns multiple range measurements to an object in its field of view, how does that data vary with respect to the average distance returned by the sensor?

Probability theory and statistics seem to be concerned with the same thing – characterizing and reasoning about uncertainty – but they are, in fact, quite different. Probability theory deals with mathematical objects (probability distributions) that are *given*; statistics deals with data that are *observed*. As roboticists, we have direct access to data, but not to the underlying probability models that characterize the data. In general, the latter are either (a) estimated from first principles, or (b) estimated using data. For former, we might know that around the holiday season, many people open presents and place the wrapping paper in the trash, and that this will affect the probabilities associated to the various kinds of trash that arrive to our trash sorting robot. For the latter, a LIDAR manufacturer might take many measurements in a strictly controlled and highly calibrated environment, and use this data to estimate the distribution of measurement errors for the LIDAR sensor.

One might wonder whether using data, via statistics, to estimate probability distributions is a good idea. In fact, there are a number of results from probability theory that explicitly address this question. One example is the *weak law of large numbers*, which says (informally) that, if data values are sampled from an underlying probability distribution, the average of the data will tend toward the expected value of the associated random variable as the data set becomes large. There are many related theorems (sometimes called *limit theorems*), but they generally share the same insight: as the size of a data set becomes large, the statistics of that data set will become increasingly good approximations for various properties of the underlying probability distribution from which the data set was generated. This insight provides a theoretical basis for the use of many probabilistic methods in practice.

**Graphical models** are useful to represent, manipulate, and compute with all of the above. Besides sampling, another way to model complex probability distributions is by using a graph framework called *Bayes networks*, which we will recognize as an excellent tool for *modeling* uncertainty in both sensing *and* acting. For inference, on the other hand, we introduce the use of *factor graphs*, which are more convenient and compact to describe relative probabilities when sensor measurements are available and can be conditioned upon.

Graphical models and statistics come together when we are trying to *learn* the parameters of Bayes networks from data. As alluded to above, we can then use these parameters to infer outcomes from measurement data, via a conversion to factor graphs. Or, we can sample from Bayes networks directly to simulate the behavior of a system. We will encounter a very convenient algorithm, ancestral sampling, that can be applied to sample from any Bayes network with ease. For example, when studying reinforcement learning in Chapter 3, we will generate sample trajectories for our logistics robot using ancestral sampling. We can use these sample trajectories to compute the expected performance over a future time horizon. For example, to evaluate the quality of a candidate control policy, we can generate a large number of sample paths (or roll-outs), and compute the average cost (or reward) over these trajectories. This provides an approximation of the value function for the policy, the quality of which increases with the number of sample paths.

## 1.3.2. Linear Algebra#

Linear algebra is a powerful mathematical tool, even in a nonlinear world.

Linear algebra is ubiquitous in robotics, from low-level control to high-level reasoning about optimal policies. This may seem surprising at first. After all, almost everything in robotics, from low-level dynamic models to high-level policy optimization is nonlinear. Nevertheless, linear algebra can be used to solve many problems in robotics.

Perhaps the most obvious use for linear algebra is to solve systems of linear equations. These occur frequently in robotics. For example, in Chapter 3, we estimate the value function for a specific policy by solving a system of linear equations. Interestingly, for a *specific, given* policy, this problem is linear, even though the problem of computing the optimal value function is nonlinear. This property leads to an efficient implementation of an algorithm called *policy iteration* that (a) uses the current estimate of the optimal value function to update a guess for the optimal policy, (b) computes the exact value function for this improved policy by solving a system of linear equations, and (c) iterates by using this new value function as the new guess for the optimal value function. The linear solver is the workhorse of this algorithm.

However, one of the most powerful uses of linear algebra comes in when we represent probability distributions over continuous variables. A Gaussian density is described by a vector and a matrix: the mean and covariance of the Gaussian density. This simple representation of densities in potentially high-dimensional continuous spaces is both compact and allows for manipulation with methods from linear algebra. Unsurprisingly, complex densities over many continuous variables can be represented using Gaussian Bayes networks and/or factor graphs, which then allow for very efficient manipulation through libraries such as GTSAM. Finally, even non-linear relationships can be approximated using linear approximations, which will come in handy when we talk about optimization below.

The Kalman filter is one of the seminal tools in robotics. In particular, for linear systems with Gaussian noise, such as the logistics robot of Chapter 4, uncertainty grows as the robot executes actions; if each action has uncertain results, then a sequence of actions will accumulate uncertainty. Happily, by using sensor measurements, it is possible to reduce uncertainty. A number of questions immediately come to mind. How do we mathematically model the propagation of uncertainty? How do we quantify the reduction in uncertainty that can be achieved using sensor data? What is the optimal way to combine predictions (using models of actions and the associated uncertainty) and observations (from sensor data) to update estimates of the robot’s state? The answers to these questions lead to the development of the well-known Kalman Filter, which we will introduce in Chapter 4. It turns out that the answer to each of these questions is formulated using linear algebra.

A less obvious use of linear algebra is to represent an object’s orientation. We all understand that orientation involves angles and trigonometric functions such as sine and cosine, so it may be surprising that we can build matrices, called *rotation matrices* that encode the orientation of a Cartesian coordinate frame. This approach works for both 2D and 3D scenarios, and best of all, composition of rotations (e.g., a drone yaws by angle \(\phi\) about its \(z\)-axis, then rolls forward by angle \(\theta\) about its \(x\)-axis) is achieved by merely multiplying the appropriate rotations matrices: \(R_ {final} = R_{z,\phi} R_{x,\theta}\). Furthermore, the inverse of a rotation matrix corresponds to performing the opposite rotation. In short, object rotations can be handled simply by applying basic tools from linear algebra.

We can take this idea a bit further, and build a slightly larger matrix that includes both a rotation matrix (as a sub-matrix) and the position of the origin of a coordinate frame. These matrices are called *homogeneous transformation matrices*, and they can be used to encode both position and orientation. As with rotation matrices, successive combinations of translation and rotation can be realized by merely multiplying the appropriate homogeneous transformation matrices, and performing the opposite translation and rotation corresponds to the inverse of the matrix. Rotation matrices and homogeneous transformation matrices are introduced for the 2D case (rotation and translation in the plane) for the car-like robot of Chapter 6, and for the 3D case for the drone in Chapter 7.

Perhaps most surprisingly, the relationship between velocities can always be encoded as a linear mapping from one vector space to another. Consider a differential drive robot with two wheels that rotate independently. As these wheels rotate, the robot will move in the world, changing its position and orientation. The *instantaneous* relationship between the angular velocities of the two wheels and the linear and angular velocities of the robot is linear! The matrix that encodes this relationship is called a *Jacobian* matrix (which may include time-varying entries that are nonlinear functions of configuration variables). You may remember Jacobian matrices from an advanced calculus class. If so, you may recall that the Jacobian of a function relates the derivatives of the function’s input to the derivatives of its output. Even for highly nonlinear functions, the instantaneous relationship between these derivatives is linear, and expressed by the Jacobian matrix. We will see Jacobian matrices for omnidirectional robots in Chapter 4, DDRs in Chapter 5, and for drone dynamics in Chapter 7.

## 1.3.3. Optimization#

Why settle, if you can have the best?

Many of the computational problems in robotics are either intractable (i.e., have high computational complexity) or cannot be solved in closed form (e.g., determining the exact form for probability distributions). In both cases, roboticists have worked to develop efficient and accurate computational methods to solve these problems. In this book, there are two main computational approaches: sampling-based approaches, which we discussed above, and optimization methods. For the latter we will almost exclusively resort to factor graphs as a way to represent the problem space and objective function. We will also use the open source library GTSAM, a powerful software system that facilitates computationally efficient solutions for a wide range of optimization problems that can be described by factor graphs.

*Optimization* is the process of finding extremal values of performance criteria, all of which, in this book, will be expressed as scalar-valued functions of a finite number of inputs. In some cases, we search for the single, scalar input that will minimize a cost function, such as when choosing the best action for a trash sorting robot in Chapter 2. In other cases, we might search for a sequence of input commands that will yield an ideal system trajectory, such as for drone flight in Chapter 7. In still other cases, we might try to find millions of weights for a neural network to minimize recognition error of our computer vision system, such as the Convolutional Neural Nets (CNNs) of Chapter 5.

In general, we can express such problems as

in which \(x\) is called the optimization variable (or variables in the case where \(x\) is a vector) and \(f(\cdot)\) is called the objective function. For this formulation, when maximizing, \(f(\cdot)\) can be thought of as a reward. We could have framed the problem as a minimization, \(\min_x f(x)\), in which case \(f(\cdot)\) should be thought of as a cost to be minimized. It is easy to convert between these two forms (e.g., simply multiply the objective function by \(-1\)), but it is often helpful to express problems specifically as either minimizing cost or maximizing reward, based on the semantics of the problem.

Many optimization problems can be solved using the method of gradient descent. Such methods construct a sequence of estimates, \(x^1, x^2, \dots\) until a minimal value of cost is found. The incremental update rule for the estimates is given by

in which \(\alpha\) is a step-size parameter. In some cases, the gradient \(\nabla f(x) \) can be computed in closed form, while for more complex functions it may be necessary to use numerical approximations of the gradient. When working with neural networks, the cost function can be written as a sum

Here, \(x\) denotes the weights assigned to the connections in the network, and \(s\) denotes a specific example in the data set \(S\). Since differentiation is linear, the gradient of this functional can be expressed as

If the data set is very large, computing \(|S|\) gradients will be prohibitive. The method of stochastic gradient descent deals with this problem by randomly selecting a few samples from the data set, \(S' \subset S\), and using the approximation

We use stochastic gradient descent in chapter 5 to optimize the weights in a deep neural network.

Quite often in robotics, the optimization variables can be written as \(x_1, x_2, \dots x_n\), in which the subscripts denote discrete instants in time. In this case, there are typically well-defined relationships between each \(x_k\) and \(x_{k+1}\). This is true, for example, when a robot moves through the world, at each step \(k\) executing command \(u_k\) and collecting data \(z_k\). Estimating the state trajectory \(x_1, \dots, x_n\) can be formulated as an optimization problem in which the value of \(u_k\) acts as a kind of constraint on the relationship between \(x_k\) and \(x_{k+1}\), as we will see in Chapter 4 when we solve the localization problem. Similarly, if we wish to optimize the trajectory of a drone (as in Chapter 7), the optimization problem begins by finding a sequence of states \(x_1, \dots, x_n\) that maximize performance criteria. In this case, in order to ensure smooth flight, \(x_k\) and \(x_{k+1}\) should not be too far apart. For problems of this sort, when there are specific relationships between the optimization variables, and especially when they enjoy this kind of sequential structure, we can solve the optimization using factor graphs, which are extremely computationally efficient when the graph that encodes variable interdependence is sparse.