*Chelsea Sidrane*

#### What is Reinforcement Learning?

In reinforcement learning, we aim to teach computers how to make decisions on their own. Say we want to teach a computer to drive a car. In order to do this, we will write a program that takes in the “world state” — here this is the state of the car itself and the state of the roadway — current velocity, speed limit, surrounding cars, etc. Given this input, the program then outputs what driving “move” to make next. We call this program a *policy*. If we’re doing deep reinforcement learning, our policy will have tunable parameters that we, the programmers, adjust via training*. *We often start out training by randomly initializing our tunable parameters. Initially, the “moves” that our policy spits out will be bad. We then collect data by trying out our policy and then use this data to tune the parameters until the policy makes good decisions for the driving task.

This data is usually in the form of state, action, and reward tuples: $latex (s,a,r)$. The s is the world state, the a is the action the policy took in that state, say, moving over one lane to the left, and the reward is something that the RL designers construct for the system. Specifically, the designer specifies a *reward function *that generates a real-valued reward given the state s and action a. So if we the RL designers, as experienced drivers, think that moving over one lane to the left when the policy finds itself behind a slow car is a good move, then we might award one point to our system for making that decision. We might also construct negative rewards for actions that could result in bad outcomes. For example, we might assign a negative reward for getting within 1 foot of any other car, as this could cause a crash.

#### Inverse Reinforcement Learning

Sometimes, it’s hard to specify these rewards. For example, we all (mostly) know what it means to “drive well” as Abbeel and Ng describe in their paper from 2004 [1]. Driving well involves keeping a safe distance from the car in front of you and from the cars on either side, driving within some margin of the speed limit, switching lanes to allow faster cars to pass you, but not switching lanes too often, slowing down and speeding up smoothly, making turns with a wide enough radius so as to keep the passengers in your car from feeling sick, and so forth. But how do we assign relative numerical weights to all of those components that ought to play into “reward”? Is keeping a safe distance from the car in front of you equally as important as accelerating smoothly? Or more important? If it’s more important, is it twice as important? Or 1.3 times as important?

#### The Ambiguity Problem

These questions are inherently difficult to answer, so sometimes instead of creating a reward function, we call in an expert to demonstrate good driving behavior, and then learn from them. One way we can learn from the expert is to directly copy their behavior, but something else we can do is to try and learn the expert’s reward function, assuming they are performing roughly optimally with respect to it. This is known as *Inverse Reinforcement Learning*. But why would we want to do this instead of directly copying the expert’s behavior? We might choose to learn the expert’s underlying reward function because we want to understand *why* the expert behaves like they do. Ultimately, we want to achieve the same outcomes as the expert [2], and the reward function more succinctly and directly describes desirable outcomes than does a policy demonstrating expert behavior [1]. Knowing the desired outcome is helpful when the expert has a different action space than our computer learner. For example, an expert human driver has actions available like “scoot forward in the driver’s seat to get a better view” but an autonomous car most likely has a camera that is fixed to its mount which cannot “scoot forward in the driver’s seat”. We may still be able to achieve the same outcome as the expert — driving well– but we may not be able to copy the expert’s actual actions to help us get there.

Okay so we’ve called our driving expert in and she’s given us some demonstrations of good driving. In one of these examples she stays in the same lane on the highway for several miles. There is an opportunity to move to a faster lane, but the open space that is available would mean that the distance to the car in front would be less than her stated, desired distance of 200 feet. However, as we look at the data, we also realize that switching lanes in that time interval would have put her above her desired lane switching frequency. As we try to reconstruct the relative weights in her reward function, in this scenario it’s unclear how to assign credit for her decision. Which was the more important factor in her decision to remain in her lane, lane switching frequency or distance to the car in front? Were the factors equally important? Was one much more important? We can’t know by looking at this data point. We could scale the weighting between the two criteria however we’d like, but because both criteria recommend the same behavior in this situation, the behavior wouldn’t change as the relative weighting changes. This is known as *reward function ambiguity*. Here we are talking about how there may be ambiguity within a single example trajectory. While this particular ambiguity may be resolved by looking at additional data, there will always exist multiple reward functions to describe observed behavior [3].

#### Maximum Entropy IRL

There have been several strategies proposed to choose among this set of reward functions that fit the observed behavior. One such strategy is called *Maximum Entropy Inverse Reinforcement Learning. *Before we get into any of the technical details, the main idea of maximum entropy inverse reinforcement learning is this:

We want to incorporate the expert data into our reconstruction of the reward function while maintaining equal preference (in terms of negative or positive rewards) over stuff (state-action pairs) that we don’t have data about.

Okay now for the technical details:

In maximum entropy inverse reinforcement learning we are going to consider a stochastic policy. In our policy we will execute a certain action with some *probability* as opposed to definitely executing one action depending on the world state.

We have some probability of choosing action $latex a_1$ at starting state $latex s_1$ which takes us to state $latex s_2$ with some probability, … and so forth. In this way, we can compute the probability of the trajectory $latex \tau = s_1, s_2$ under our policy. In fact, under our policy, we can compute the probability of any possible trajectory $latex \tau$. These probabilities make up a distribution over trajectories.

We want to construct a reward function and then compute an optimal or approximately optimal policy with respect to that reward function. We want the distribution over trajectories under that policy to match the expert’s distribution over trajectories, but we don’t have access to the expert’s full distribution — only sample demonstrations, which make up our data set. However if all we’re worried about is fitting the data set, there are many reward functions that would do so, as we explained earlier. Ziebart proposed that the way to break the “tie” between all of these reward functions was to select the one that was “maximally non-committal” regarding missing information [3]. Specifically, we want the distribution over trajectories from the resulting policy to be “maximally non-committal” with respect to trajectories that it doesn’t have data about.

This idea of a “maximally non-committal” distribution given the data was formalized by Jaynes in 1957 [4], who defined this distribution using an idea from information theory: entropy.

What’s entropy? Let’s take a quick digression.

______________________________________

#### Entropy

Entropy measures the “surprise” of a distribution. If you have a random variable $latex X$ which takes values $latex x\in\mathcal{X} $, if $latex x$ occurs with low probability, then $latex \frac{1}{p(x)} $ is a large number — you can say it is very surprising to see value $latex x$. If we take a log of this quantity to get $latex \log_2 \frac{1}{p(x)}$ the behavior is still qualitatively similar — $latex \log_2 \frac{1}{p(x)}$ increases as $latex p(x) $ approaches 0.

We then take an expectation over the possible values to obtain the entropy, H:

$latex H(X) = \mathbb{E}\left[log_2 \frac{1}{p(x)}\right] = – \sum_{x\in\mathcal{X}} p(x) log_2(p(x)) $

The entropy of a distribution is largest when it is uniformly distributed, meaning that $latex X$ takes on each $latex x$ with equal probability: $latex p(x) = \frac{1}{\mid \mathcal{X} \mid} $. Entropy can also be interpreted as measuring the uncertainty in a distribution. A sharply peaked distribution has low entropy and we can be more confident in predicting the value of a sample (we will be less often “surprised”) than when compared to a uniform distribution (we will always be “surprised”).

______________________________________

Jaynes concluded that the distribution that introduces the least bias with respect to yet unseen data is the distribution that maximizes the likelihood of the data and has the maximum entropy. So in Maximum Entropy IRL we solve the ambiguity problem by selecting a reward function such that the resulting distribution over trajectories maximizes the likelihood of the expert data and also has the maximum entropy. Maximizing the entropy in this way represents an acknowledgement that we want to be *uncertain* or *ambivalent* about all trajectories that have the same ‘distance’ to our dataset. Note that we do have to define the notion of distance. We do this through defining feature functions for the trajectories that turn each trajectory into a feature vector.

To wrap things up, we’ll give an example of the outcome of using Maximum Entropy IRL. Say I have a driving dataset where I only drive straight on local roads. If I used MaxEnt IRL to extract a reward function for this dataset, I would obtain a reward function that indicated preference for going straight above turning left or right, but would give equal reward values for going both left and right.

And that’s it! Thanks for reading.

#### References

[1] Abbeel, Pieter, and Andrew Y. Ng. “Apprenticeship learning via inverse reinforcement learning.” In *Proceedings of the twenty-first international conference on Machine learning*, p. 1. ACM, 2004.

[2] Finn, Chelsea. “Inverse Reinforcement Learning.” Lecture video, CS294-112 Deep Reinforcement Learning Sp17. URL

[3] Ziebart, Brian D., Andrew L. Maas, J. Andrew Bagnell, and Anind K. Dey. “Maximum entropy inverse reinforcement learning.” In *Aaai*, vol. 8, pp. 1433-1438. 2008.

[4] Jaynes, Edwin T. “Information theory and statistical mechanics.” *Physical review* 106, no. 4 (1957): 620.

Other useful references on IRL:

Lecture Notes by Professor Katerina Fragkiadaki