Here’s a quick tour of the debate about inverse reinforcement learning (IRL) and cognitive biases, featuring many of the ideas from the first chapter of the Value Learning sequence:

I had the intuition that the impossibility theorem was like the other no-free-lunch theorems in ML: not actually relevant for what ML could do in practice. So we tried to learn and correct for systematic biases in IRL.

The idea behind the algorithms

The basic idea was to learn the planning algorithm by which the human produces demonstrations, and try to ensure that the planning algorithm captured the appropriate systematic biases. We used a Value Iteration Network to give an inductive bias towards “planners” but otherwise did not assume anything about the form of the systematic bias. [1] Then, we could perform IRL by figuring out which reward would cause the planning algorithm to output the given demonstrations. The reward would be “debiased” because the effect of the biases on the policy would already be accounted for in the planning algorithm.

How could we learn the planning algorithm? Well, one baseline method is to assume that we have access to some tasks where the rewards are known, and use those tasks to learn what the planning algorithm is. Then, once that is learned, we can infer the rewards for new tasks that we haven’t seen before. This requires the planner to generalize across tasks.

However, it’s kind of cheating to assume access to ground truth rewards, since we usually wouldn’t have them. What if we learned the planning algorithm and rewards simultaneously? Well, the no-free-lunch theorem gets us then: maximizing the true reward and minimizing the negative of the true reward would lead to the same policy, and so you can’t distinguish between them, and so the output of your IRL algorithm could be the true reward or the negative of the true reward. It would be really bad if our IRL algorithm said exactly the opposite of what we want. But surely we can at least assume that humans are not expected utility minimizers in order to eliminate this possibility.

So, we make the assumption that the human is “near-optimal”. We initialize the planning algorithm to be optimal, and then optimize for a planning algorithm that is “near” the optimal planner, in gradient-descent-space, that combined with the (learned) reward function explains the demonstrations. You might think that a minimizer is in fact “near” a maximizer; empirically this didn’t turn out to be the case, but I don’t have a particularly compelling reason why that happened.

Results

Here’s the graph from our paper, showing the performance of various algorithms on some simulated human biases (higher = better). Both of our algorithms get access to the simulated human policies on multiple tasks. Algorithm 1 is the one that gets access to ground-truth rewards for some tasks, while Algorithm 2 is the one that instead tries to ensure that the learned planner is “near” the optimal planner. “Boltzmann” and “Optimal” mean that the algorithm assumes that the human is Boltzmann rational and optimal respectively.

Our algorithms work better on average, mostly by being robust to the specific kind of bias that the demonstrator had—they tend to perform on par with the better of the Boltzmann and Optimal baseline algorithms. Surprisingly (to me), the second algorithm sometimes outperforms the first, even though the first algorithm has access to more data (since it gets access to the ground truth rewards in some tasks). This could be because it exploits the assumption that the demonstrator is near-optimal, which the first algorithm doesn’t do, even though the assumption is correct for most of the models we test. On the other hand, maybe it’s just random noise.

Implications

Superintelligent AI alignment

The most obvious way that this is relevant to AI alignment is that it is progress on ambitious value learning, where we try to learn a utility function that encodes all of human values.

“But wait,″ you say, “didn’t you argue that ambitious value learning is unlikely to work?”

Well, yes. At the time that I was doing this work, I believed that ambitious value learning was the only option, and seemed hard but not doomed. This was the obvious thing to do to try and advance it. But this was over a year ago, the reason it’s only now coming out is that it took a while to publish the paper. (In fact, it predates my state of the world work.) But it’s true that now I’m not very hopeful about ambitious value learning, and so this paper’s contribution towards it doesn’t seem particularly valuable to me. However, a few others remain optimistic about ambitious value learning, and if they’re right, this research might be useful for that pathway to aligned AI.

I do think that the paper contributes to narrow value learning, and I still think that this very plausibly will be relevant to AI alignment. It’s a particularly direct attack on the specification problem, with the goal of inferring a specification that leads to a policy that would outperform the demonstrator. That said, I am no longer very optimistic about approaches that require a specific structure (in this case, world models fed into a differentiable planner with an inductive bias that then produces actions), and I am also less optimistic about using approaches that try to mimic expected value calculations, rather than trying to do something more like norm inference.

(However, I still expect that the impossibility result in preference learning will only be a problem in theory, not in practice. It’s just that this particular method of dealing with it doesn’t seem like it will work.)

Near-term AI issues

In the near term, we will need better ways than reward functions to specify the behavior that we want to an AI system. Inverse reinforcement learning is probably the leading example of how we could do this. However, since the specific algorithms require much better differentiable planners before they will perform on par with existing algorithms, it may be some time before they are useful. In addition, it’s probably better to use specific bias models in the near term. Overall, I think these methods or ideas are about as likely to be used in the near term as the average paper (which is to say, not very likely).

A Value Iteration Network is a fully differentiable neural network that embeds an approximate value iteration algorithm inside a feed-forward classification network. ↩︎

Typically, inverse reinforcement learning assumes that the demonstrator is optimal, or that any mistakes they make are caused by random noise. Without a model of how the demonstrator makes mistakes, we should expect that IRL would not be able to outperform the demonstrator. So, a natural question arises: can we learn the systematic mistakes that the demonstrator makes from data? While there is an impossibility result here, we might hope that it is only a problem in theory, not in practice.

In this paper, my coauthors and I propose that we learn the cognitive biases of the demonstrator, by learning their planning algorithm. The hope is that the cognitive biases are encoded in the learned planning algorithm. We can then perform bias-aware IRL by finding the reward function that when passed into the planning algorithm results in the observed policy. We have two algorithms which do this, one which assumes that we know the ground-truth rewards for some tasks, and one which tries to keep the learned planner “close to” the optimal planner. In a simple environment with simulated human biases, the algorithms perform better than the standard IRL assumptions of perfect optimality or Boltzmann rationality—but they lose a lot of performance by using an imperfect differentiable planner to learn the planning algorithm.

Planned opinion:

Although this only got published recently, it’s work I did over a year ago. I’m no longer very optimistic about ambitious value learning, and so I’m less excited about its impact on AI alignment now. In particular, it seems unlikely to me that we will need to infer all human values perfectly, without any edge cases or uncertainties, which we then optimize as far as possible. I would instead want to build AI systems that start with an adequate understanding of human preferences, and then learn more over time, in conjunction with optimizing for the preferences they know about. However, this paper is more along the former line of work, at least for long-term AI alignment.

I do think that this is a contribution to the field of inverse reinforcement learning—it shows that by using an appropriate inductive bias, you can become more robust to (cognitive) biases in your dataset. It’s not clear how far this will generalize, since it was tested on simulated biases on simple environments, but I’d expect it to have at least a small effect. In practice though, I expect that you’d get better results by providing more information, as in T-REX.

I like this example of “works in practice but not in theory.” Would you associate “ambitious value learning vs. adequate value learning” with “works in theory vs. doesn’t work in theory but works in practice”?

One way that “almost rational” is much closer to optimal than “almost anti-anti-rational” is ye olde dot product, but a more accurate description of this case would involve dividing up the model space into basins of attraction. Different training procedures will divide up the space in different ways—this is actually sort of the reverse of a monte carlo simulation where one of the properties you might look for is ergodicity (eventually visiting all points in the space).

## Learning biases and rewards simultaneously

I’ve finally uploaded to arXiv our work on inferring human biases alongside IRL, which was published at ICML 2019.

## Summary of the paper

## The IRL Debate

Here’s a quick tour of the debate about inverse reinforcement learning (IRL) and cognitive biases, featuring many of the ideas from the first chapter of the Value Learning sequence:

I had the intuition that the impossibility theorem was like the other no-free-lunch theorems in ML: not actually relevant for what ML could do in practice. So we tried to learn and correct for systematic biases in IRL.

## The idea behind the algorithms

The basic idea was to learn the

planning algorithmby which the human produces demonstrations, and try to ensure that the planning algorithm captured the appropriate systematic biases. We used a Value Iteration Network to give an inductive bias towards “planners” but otherwise did not assume anything about the form of the systematic bias. [1] Then, we could perform IRL by figuring out which reward would cause the planning algorithm to output the given demonstrations. The reward would be “debiased” because the effect of the biases on the policy would already be accounted for in the planning algorithm.How could we learn the planning algorithm? Well, one baseline method is to assume that we have access to some tasks where the

rewards are known, and use those tasks to learn what the planning algorithm is. Then, once that is learned, we can infer the rewards for new tasks that we haven’t seen before. This requires the planner to generalize across tasks.However, it’s kind of cheating to assume access to ground truth rewards, since we usually wouldn’t have them. What if we learned the planning algorithm and rewards simultaneously? Well, the no-free-lunch theorem gets us then: maximizing the true reward and minimizing the negative of the true reward would lead to the same policy, and so you can’t distinguish between them, and so the output of your IRL algorithm could be the true reward or the

negativeof the true reward. It would be really bad if our IRL algorithm said exactly the opposite of what we want. But surely we can at least assume that humans are not expected utilityminimizersin order to eliminate this possibility.So, we make the assumption that the human is “near-optimal”. We initialize the planning algorithm to be optimal, and then optimize for a planning algorithm that is “near” the optimal planner, in gradient-descent-space, that combined with the (learned) reward function explains the demonstrations. You might think that a minimizer is in fact “near” a maximizer; empirically this didn’t turn out to be the case, but I don’t have a particularly compelling reason why that happened.

## Results

Here’s the graph from our paper, showing the performance of various algorithms on some simulated human biases (higher = better). Both of our algorithms get access to the simulated human policies on multiple tasks. Algorithm 1 is the one that gets access to ground-truth rewards for some tasks, while Algorithm 2 is the one that instead tries to ensure that the learned planner is “near” the optimal planner. “Boltzmann” and “Optimal” mean that the algorithm assumes that the human is Boltzmann rational and optimal respectively.

Our algorithms work better on average, mostly by being robust to the specific kind of bias that the demonstrator had—they tend to perform on par with the better of the Boltzmann and Optimal baseline algorithms. Surprisingly (to me), the second algorithm sometimes outperforms the first, even though the first algorithm has access to more data (since it gets access to the ground truth rewards in some tasks). This could be because it exploits the assumption that the demonstrator is near-optimal, which the first algorithm doesn’t do, even though the assumption is correct for most of the models we test. On the other hand, maybe it’s just random noise.

## Implications

## Superintelligent AI alignment

The most obvious way that this is relevant to AI alignment is that it is progress on ambitious value learning, where we try to learn a utility function that encodes all of human values.

“But wait,″ you say, “didn’t you argue that ambitious value learning is unlikely to work?”

Well, yes. At the time that I was doing this work, I believed that ambitious value learning was the only option, and seemed hard but not doomed. This was the obvious thing to do to try and advance it. But this was over a year ago, the reason it’s only now coming out is that it took a while to publish the paper. (In fact, it predates my state of the world work.) But it’s true that

nowI’m not very hopeful about ambitious value learning, and so this paper’s contribution towards it doesn’t seem particularly valuable to me. However, a few others remain optimistic about ambitious value learning, and if they’re right, this research might be useful for that pathway to aligned AI.I do think that the paper contributes to narrow value learning, and I still think that this very plausibly will be relevant to AI alignment. It’s a particularly direct attack on the specification problem, with the goal of inferring a specification that leads to a policy that would outperform the demonstrator. That said, I am no longer very optimistic about approaches that require a specific structure (in this case, world models fed into a differentiable planner with an inductive bias that then produces actions), and I am also less optimistic about using approaches that try to mimic expected value calculations, rather than trying to do something more like norm inference.

(However, I still expect that the impossibility result in preference learning will only be a problem in theory, not in practice. It’s just that this particular method of dealing with it doesn’t seem like it will work.)

## Near-term AI issues

In the near term, we will need better ways than reward functions to specify the behavior that we want to an AI system. Inverse reinforcement learning is probably the leading example of how we could do this. However, since the specific algorithms require much better differentiable planners before they will perform on par with existing algorithms, it may be some time before they are useful. In addition, it’s probably better to use specific bias models in the near term. Overall, I think these methods or ideas are about as likely to be used in the near term as the average paper (which is to say, not very likely).

A Value Iteration Network is a fully differentiable neural network that embeds an approximate value iteration algorithm inside a feed-forward classification network. ↩︎

Planned summary:Typically, inverse reinforcement learning assumes that the demonstrator is optimal, or that any mistakes they make are caused by random noise. Without a model of

howthe demonstrator makes mistakes, we should expect that IRL would not be able to outperform the demonstrator. So, a natural question arises: can we learn the systematic mistakes that the demonstrator makes from data? While there is an impossibility result here, we might hope that it is only a problem in theory, not in practice.In this paper, my coauthors and I propose that we learn the cognitive biases of the demonstrator, by learning their planning algorithm. The hope is that the cognitive biases are encoded in the learned planning algorithm. We can then perform bias-aware IRL by finding the reward function that when passed into the planning algorithm results in the observed policy. We have two algorithms which do this, one which assumes that we know the ground-truth rewards for some tasks, and one which tries to keep the learned planner “close to” the optimal planner. In a simple environment with simulated human biases, the algorithms perform better than the standard IRL assumptions of perfect optimality or Boltzmann rationality—but they lose a lot of performance by using an imperfect differentiable planner to learn the planning algorithm.

Planned opinion:Although this only got published recently, it’s work I did over a year ago. I’m no longer very optimistic about ambitious value learning, and so I’m less excited about its impact on AI alignment now. In particular, it seems unlikely to me that we will need to infer all human values perfectly, without any edge cases or uncertainties, which we then optimize as far as possible. I would instead want to build AI systems that start with an adequate understanding of human preferences, and then learn more over time, in conjunction with optimizing for the preferences they know about. However, this paper is more along the former line of work, at least for long-term AI alignment.

I do think that this is a contribution to the field of inverse reinforcement learning—it shows that by using an appropriate inductive bias, you can become more robust to (cognitive) biases in your dataset. It’s not clear how far this will generalize, since it was tested on simulated biases on simple environments, but I’d expect it to have at least a small effect. In practice though, I expect that you’d get better results by providing more information, as in T-REX.

I like this example of “works in practice but not in theory.” Would you associate “ambitious value learning vs. adequate value learning” with “works in theory vs. doesn’t work in theory but works in practice”?

One way that “almost rational” is much closer to optimal than “almost anti-anti-rational” is ye olde dot product, but a more accurate description of this case would involve dividing up the model space into basins of attraction. Different training procedures will divide up the space in different ways—this is actually sort of the reverse of a monte carlo simulation where one of the properties you might look for is ergodicity (eventually visiting all points in the space).

Potentially. I think the main question is whether adequate value learning will work in practice.