Deep Reinforcement Learning

6.6. Deep Reinforcement Learning#

Deep Q-learning and policy gradient.

Splash image with steampunk autonomous car

Deep reinforcement learning (DRL) applies the power of deep learning to bring reinforcement learning to much more complex domains than what we were able to tackle with the Markov Decision Processes and RL concepts introduced in Chapter 3. The use of large, expressive neural networks has allowed researchers and practitioners alike to work with high bandwidth sensors such as video streams and LIDAR, and bring the promise of RL into real-world domains such as autonomous driving. This is still a field of active discovery and research, however, and we can give but a brief introduction here about what is a vast literature and problem space.

6.6.1. RL and Autonomous Driving#

A simple example in the autonomous driving domain is lane switching. Suppose we are driving along at 3-lane highway, and we can see some ways ahead, and -using the rear-view mirror- some ways behind us. We are driving at a speed that is comfortable to us, but other cars have different ideas about their optimal driving speed. Hence, sometimes we would like to change lanes, and we could learn a policy to do this for us. As discussed in Section 6.5, this is lateral control. A more sophisticated example would also allow us to adapt our speed to the traffic pattern, but by relying on a smart cruise control system we could safely ignore the longitudinal control problem.

To turn this into a reinforcement learning problem, we first need to define a state space \({\cal X}\) and an action space \({\cal A}\). There are a variety of ways to engineer this aspect of the problem. For example, we could somehow encode the longitudinal distance and lane index for each of the K closest cars, where K is a parameter, say 5 or 10. One problem is that the number of cars that are actually present is variable, which is difficult to deal with. Another approach is to make this into an image processing problem, by creating a finite element representation of the road before and behind us, and marking each cell as occupied or not. The latter is fairly compatible with automotive sensors such as LIDAR.

In terms of actions, the easiest approach is to have a number of discrete choices to go left, right, or stay in the current lane. We could be more sophisticated about it and have both “aggressive” and “slow” versions of these in addition to a default version, akin to the motion primitives we previously discussed.

Actually implementing this on an autonomous vehicle, or even sketching an implementation in a notebook with recorded or simulated data, is beyond what we can accomplish in a jupyter notebook. Hence, we will be content below to sketch three popular foundational methods from deep reinforcement learning, without actually implementing them here. At the end of this chapter we provide some references where you can delve into these topics more deeply.

6.6.2. Deep Q-Learning#

DQN is an early deep learning RL method akin to Q-learning.

Recall from Section 3.6 that we can define a policy in terms of Q-values, sometimes also called state-action values, and that we can define the optimal policy as

(6.68)#\[\begin{equation} \pi^*(x) = \arg \max_a Q^*(x,a) \end{equation}\]

where \(Q^*(x,a)\) denote the Q-values for the optimal policy. In Q-learning, we start with some random Q-values and then iteratively improve an estimate \(\hat{Q}(x,a)\) for the optimal Q-values by alpha-blending between old and new estimates:

(6.69)#\[\begin{equation} \hat{Q}(x,a) \leftarrow (1-\alpha) \hat{Q}(x,a) + \alpha~\text{target}(x,a,x'). \end{equation}\]

Above, the “target value”

(6.70)#\[\begin{equation} \text{target}(x,a,x') \doteq R(x,a,x') + \gamma \max_{a'} \hat{Q}(x',a') \end{equation}\]

is a value that we think is an improvement on the previous value \(\hat{Q}(x,a)\). Indeed: \(\text{target}(x,a,x')\) uses the current estimate of the Q-values for future states, but improves on this by using the known rewards \(R(x,a,x')\) for the current action \(a\) in the current state \(x\).

In the deep Q-network or DQN method we use a supervised learning approach to Q-learning. We train a neural network, parameterized by \(\theta\), to approximate the optimal Q-values:

(6.71)#\[\begin{equation} Q^*(x,a) \approx \hat{Q}(x,a; \theta) \end{equation}\]

It might be worthwhile at this point to re-visit Section 5.6, where we introduced neural networks and how to train them using stochastic gradient descent (SGD).

In the context of RL, the DQN method uses two additional ideas that are crucial in making the training converge to something sensible in difficult problems. The first is splitting the training into execution and experience replay phases:

during the execution phase, the policy is executed (possibly with some degree of randomness) and the experiences \((x,a,r,x')\), with \(r\) the reward, are stored in a dataset \(D\);
during experience replay, we randomly sample from these experiences to create mini-batches of data, which are in turn used to perform SGD on the parameters \(\theta\).

The second idea is to calculate the target values

(6.72)#\[\begin{equation} \text{target}(x,a,x') \doteq R(x,a,x') + \gamma \max_{a'} \hat{Q}(x',a'; \theta^{old}) \end{equation}\]

with the parameters \(\theta^{old}\) from the previous epoch, to provide a more stable approximation target. The mini-batch loss we minimize using SGD is then

(6.73)#\[\begin{equation} \mathcal{L}_{\text{DQN}}(\theta; D) \doteq \sum_{(x,a,r,x')\in D} [\text{target}(x,a,x') - Q(x,a; \theta)]^2 \end{equation}\]

With this basic scheme, a team from DeepMind was able to achieve human or super-human performance on about 50 Atari 2600 games in 2015 [Mnih et al., 2015]. DQN is a so-called off-policy method, in that each execution phase uses the best policy we computed so far, but we can still replay earlier experiences gathered with “lesser” policies. Nothing in the experience replay phase references the policy: every experience leads to a valid Q-value backup and a valid supervised learning signal.

6.6.3. Policy Optimization#

Policy optimization takes a black box optimization approach to a deep policy.

Whereas the above gets at an optimal policy indirectly, via deep Q-learning, a different and very popular idea is to directly parameterize the policy using a neural network, with weights \(\theta\). It is common to make this a stochastic policy,

(6.74)#\[\begin{equation} \pi(a|x; \theta) \end{equation}\]

where \(a \in {\cal A}\) is an action, \(x \in {\cal X}\) is a state, and the policy outputs a probability for each action \(a\) based on the state \(x\). One of the reasons to prefer stochastic policies is that they are differentiable, as they output continuous values rather than discrete actions. This allows us to optimize for them via gradient descent, as we explore in the next section.

In Chapter 5 we used supervised learning to train neural networks, and we just applied this for learning Q-values in DQN. It is useful to consider how this might work for training a policy. Recall from Section 5.6 that we defined the empirical cross-entropy loss as

(6.75)#\[\begin{equation} \mathcal{L}_{\text{CE}}(\theta; D) \doteq - \sum_{(x,y=c)\in D} \sum_c \log p_c(x;\theta) \end{equation}\]

where \(D\) is a dataset of supervised learning examples. Above \(y\) takes on a discrete class value \(c\), but another very common way to write this is via a “one-hot encoding”. Let \(y_c\) be 1 if \(y=c\) and 0 otherwise, then our loss becomes:

(6.76)#\[\begin{equation} \mathcal{L}_{\text{CE}}(\theta; D) = -\sum_{(x,y)\in D} \sum_c y_c \log p_c(x;\theta) \end{equation}\]

This formulation is equivalent, but now we are summing over all classes for each data point, with \(y_c\) acting as a weight, either one or zero. When someone is so kind as to give us the optimal action \(y_a\) (as a one-hot encoding) for every state \(x\) in some dataset \(D\), we can apply this same loss function to a stochastic policy, obtaining

(6.77)#\[\begin{equation} \mathcal{L}_{\text{CE}}(\theta; D) = -\sum_{(x,y)\in D} \sum_a y_a \log \pi(a| x; \theta) \end{equation}\]

In policy optimization we gather data by rolling out a set of trajectories \(\tau_i\). In supervised learning we have a dataset \(D\) and labels \(y_c\), but we have to proceed a bit differently in a reinforcement learning setting. In particular, for on-policy RL we gather data by executing our current best guess for the policy for some rollout length or horizon \(H\), and we do this many different times, each time obtaining a trajectory \(\tau_i\). That still leaves the training signal: where does that come from? The key idea is to estimate how good a particular action is by estimating the state-action values \(Q\) from the rollout rewards. In detail, we estimate the expected discounted reward starting at \(x_{it}\), and taking action \(a_{it}\), as

(6.78)#\[\begin{equation} \hat{Q}(x_{it},a_{it}) \doteq \sum_{k=t}^H \gamma^{k-t}R(x_{ik},a_{ik},x_{ik}'). \end{equation}\]

Note in each rollout we can only sum until \(k=H\), so Q-values earlier in the rollout will be estimated more accurately. Regardless, we can then use these estimated Q-values as an alternative to the “one or zero” weight above, obtaining the surrogate loss \(\mathcal{L}(\theta)\) by averaging over the trajectories \(\tau_i\):

(6.79)#\[\begin{equation} \mathcal{L}(\theta) = - \sum_i \sum_{t=1}^H \hat{Q}(x_{it},a_{it}) \log \pi(a_{it}|x_{it} \theta). \end{equation}\]

As you can see, once again estimated Q-values are involved, and they are expected to converge to the optimal Q-values over time. This time, however, we parameterize the policy instead. The Q-values act as weights in the surrogate loss function that looks very much like the supervised cross-entropy loss, but now our supervision comes from experience.

Putting this all together yields the basic policy optimization algorithm:

Initialize \(\theta\)
Until convergence:
1. roll out a number of trajectories \(\tau_i\) using the current policy \(\pi(a;x,\theta)\)
2. try to change the parameters \(\theta\) as to decrease the surrogate loss function \(\mathcal{L}(\theta)\)

A simple, gradient-free approach for step 2 is simple hill-climbing, aka stochastic search:

perturb \(\theta\) to \(\theta'\)
set \(\theta \leftarrow \theta'\) iff \(\mathcal{L}(\theta') < \mathcal{L}(\theta)\)

The perturbation step could be as simple as adding some Gaussian noise to the weights of the network. More sophisticated “black-box” approaches such as genetic algorithms or evolution strategies can be applied to this problem instead, but they all share the property that they are “gradient-free”, which seems to be a sub-optimal strategy. Hence, we next look at a gradient descent approach to policy optimization.

6.6.4. Policy Gradient Methods#

Policy gradient methods are akin to policy iteration, with a neural flavor.

In a nutshell, policy gradient methods calculate the gradient of the surrogate loss \(\mathcal{L}(\theta)\) defined above with respect to the policy parameters \(\theta\):

(6.80)#\[\begin{equation} \nabla_\theta \mathcal{L}(\theta) \leftarrow - \sum_i \sum_{t=1}^H \hat{Q}(x_{it},a_{it}) \nabla_\theta \log \pi(a_{it}|x_{it}, \theta), \end{equation}\]

where \(\nabla_\theta \log \pi(a_{it}|x_{it}, \theta)\) is the gradient of the logarithm of the stochastic policy. This is easily obtained via back-propagation using any neural network framework of choice. In the case that actions are discrete, as in our example above, a stochastic policy network typically ends with a softmax function. Recall that if the network outputs a vector of raw scores (the logits) \(z \in \mathbb{R}^K\), the softmax is defined as

(6.81)#\[\begin{equation} \mathrm{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}, \quad i = 1,\dots,K. \end{equation}\]

Thus, the logits are the raw outputs before applying the softmax, and \(\nabla_\theta \log \pi(a_{it}|x_{it}, \theta)\) is computed with respect to these values. We then use gradient descent to update the policy parameters:

(6.82)#\[\begin{equation} \theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}(\theta) \end{equation}\]

where \(\alpha\) is a learning rate.

The algorithm above, using the estimated Q-values, is almost identical to the REINFORCE method [Williams, 1992]. That algorithm further improves on performance by not using the raw Q-values but rather the difference between the Q-values and some baseline policy. This has the effect of reducing the variance in the estimated Q-values due to using only a finite amount of data. The REINFORCE algorithm was introduced in 1992 and hence pre-dates the deep-learning revolution by about 20 years. It should also be said that in DRL, the neural networks that are used are typically not very deep. Several modern methods, such as “proximal policy optimization” (PPO) [Schulman et al., 2017] apply a number of techniques to improve this basic method even further and make it more sample-efficient. PPO is now one of the most often-used DRL methods.