Typically, inverse reinforcement learning assumes that the demonstrator is optimal, or that any mistakes they make are caused by random noise. Without a model of how the demonstrator makes mistakes, we should expect that IRL would not be able to outperform the demonstrator. So, a natural question arises: can we learn the systematic mistakes that the demonstrator makes from data? While there is an impossibility result here, we might hope that it is only a problem in theory, not in practice.
In this paper, my coauthors and I propose that we learn the cognitive biases of the demonstrator, by learning their planning algorithm. The hope is that the cognitive biases are encoded in the learned planning algorithm. We can then perform bias-aware IRL by finding the reward function that when passed into the planning algorithm results in the observed policy. We have two algorithms which do this, one which assumes that we know the ground-truth rewards for some tasks, and one which tries to keep the learned planner “close to” the optimal planner. In a simple environment with simulated human biases, the algorithms perform better than the standard IRL assumptions of perfect optimality or Boltzmann rationality—but they lose a lot of performance by using an imperfect differentiable planner to learn the planning algorithm.
Planned opinion:
Although this only got published recently, it’s work I did over a year ago. I’m no longer very optimistic about ambitious value learning, and so I’m less excited about its impact on AI alignment now. In particular, it seems unlikely to me that we will need to infer all human values perfectly, without any edge cases or uncertainties, which we then optimize as far as possible. I would instead want to build AI systems that start with an adequate understanding of human preferences, and then learn more over time, in conjunction with optimizing for the preferences they know about. However, this paper is more along the former line of work, at least for long-term AI alignment.
I do think that this is a contribution to the field of inverse reinforcement learning—it shows that by using an appropriate inductive bias, you can become more robust to (cognitive) biases in your dataset. It’s not clear how far this will generalize, since it was tested on simulated biases on simple environments, but I’d expect it to have at least a small effect. In practice though, I expect that you’d get better results by providing more information, as in T-REX.
Planned summary:
Typically, inverse reinforcement learning assumes that the demonstrator is optimal, or that any mistakes they make are caused by random noise. Without a model of how the demonstrator makes mistakes, we should expect that IRL would not be able to outperform the demonstrator. So, a natural question arises: can we learn the systematic mistakes that the demonstrator makes from data? While there is an impossibility result here, we might hope that it is only a problem in theory, not in practice.
In this paper, my coauthors and I propose that we learn the cognitive biases of the demonstrator, by learning their planning algorithm. The hope is that the cognitive biases are encoded in the learned planning algorithm. We can then perform bias-aware IRL by finding the reward function that when passed into the planning algorithm results in the observed policy. We have two algorithms which do this, one which assumes that we know the ground-truth rewards for some tasks, and one which tries to keep the learned planner “close to” the optimal planner. In a simple environment with simulated human biases, the algorithms perform better than the standard IRL assumptions of perfect optimality or Boltzmann rationality—but they lose a lot of performance by using an imperfect differentiable planner to learn the planning algorithm.
Planned opinion:
Although this only got published recently, it’s work I did over a year ago. I’m no longer very optimistic about ambitious value learning, and so I’m less excited about its impact on AI alignment now. In particular, it seems unlikely to me that we will need to infer all human values perfectly, without any edge cases or uncertainties, which we then optimize as far as possible. I would instead want to build AI systems that start with an adequate understanding of human preferences, and then learn more over time, in conjunction with optimizing for the preferences they know about. However, this paper is more along the former line of work, at least for long-term AI alignment.
I do think that this is a contribution to the field of inverse reinforcement learning—it shows that by using an appropriate inductive bias, you can become more robust to (cognitive) biases in your dataset. It’s not clear how far this will generalize, since it was tested on simulated biases on simple environments, but I’d expect it to have at least a small effect. In practice though, I expect that you’d get better results by providing more information, as in T-REX.