Learning preferences by looking at the world

Link post

We’ve written up a blog post about our recent paper that I’ve been linking to but haven’t really announced or explained. The key idea is that since we’ve optimized the world towards our preferences, we can infer these preferences just from the state of the world. We present an algorithm called Reward Learning by Simulating the Past (RLSP) that can do this in simple environments, but my primary goal is simply to show that there is a lot to be gained by inferring preferences from the world state.

The rest of this post assumes that you’ve read at least the non-technical part of the linked blog post. This post is entirely my own and may not reflect the views of my coauthors.

Other sources of intuition

The story in the blog post is that when you look at the state of the world, you can figure out what humans have put effort into, and thus what they care about. There are other intuition pumps that you can use as well:

  • The world state is “surprisingly” ordered and low-entropy. Anywhere you see such order, you can bet that a human was responsible for it, and that the human cared about it.

  • If you look across the world, you’ll see many patterns recurring again and again—vases are usually intact, glasses are usually upright, and laptops are usually on desks. Patterns that wouldn’t have happened without humans are likely something humans care about.

How can a single state do so much?

You might be wondering how a single state could possibly contain so much information. And you would be correct to wonder that. This method depends very crucially on the assumption of known dynamics (i.e. a model of “how the world works”) and a good featurization.

Known dynamics. This is what allows you to simulate the past, and figure out what “must have happened”. Using the dynamics, the robot can figure out that breaking a vase is irreversible, and that Alice must have taken special care to avoid doing so. This is also what allows us to distinguish between effects caused by humans (which we care about) and effects caused by the environment (which we don’t care about).

If you take away the knowledge of dynamics, much of the oomph of this method is gone. You could still look for and preserve repetitions in the state—maybe there are a lot of intact vases and no broken vases, so you try to keep vases intact. But this might also lead you to making sure that nobody puts warning signs near cliffs, since most cliffs don’t have warning signs near them.

But notice that dynamics are an empirical fact about the world, and do not depend on “values”. We should expect powerful AI systems to have a good understanding of dynamics. So I’m not too worried about the fact that we need to know dynamics for this to work well.

Features. A good featurization on the other hand allows you to focus on reward functions that are “reasonable” or “about the important parts”. It eliminates a vast swathe of strange, implausible reward functions that you otherwise would not be able to eliminate. If you didn’t have a good featurization and instead had rewards that were any function mapping from states to rewards, then you would typically learn some degenerate reward, such as mapping to reward 1 and mapping everything else to reward 0. (IRL faces the same problem of degenerate rewards. Since we observe strictly less than IRL does, we face the same problem.)

I’m not sure whether features are more like empirical facts, or more like values. It sure seems like there are very natural ways to understand the world that imply a certain set of features, and that a powerful AI system is likely to have these features; but maybe it only feels this way because we humans actually use those features to understand the world. I hope to test this in future work by trying out RLSP-like algorithms in more realistic environments where we first learn features in an unsupervised manner.

Connection to impact measures

Preferences inferred from the state of the world are kind of like impact measures in that they allow us to infer all of the “common sense” rules that humans follow that tell us what not to do. The original motivating example for this work was a more complicated version of the vase environment, which is the standard example for negative side effects. (It was more complicated because at the time I thought it was important for there to be “repetitions” in the environment, e.g. multiple intact vases.)

Desiderata. I think that there are three desiderata for impact measures that are very hard to meet in concert. Let us say that an impact measure must also specify the set of reward functions it is compatible with. For example, attainable utility preservation (AUP) aims to be compatible with rewards whose codomain is [0, 1]. Then the desiderata are:

  • Prevent catastrophe: The impact measure prevents all catastrophic outcomes, regardless of which compatible reward function the AI system optimizes.

  • Do what we want: There exists some compatible reward function such that the AI system does the things that we want, despite the impact measure.

  • Value agnostic: The design of the impact measure (both the penalty and the set of compatible rewards) should be agnostic to human values.

Note that the first two desiderata are about what the impact measure actually does, as opposed to what we can prove about it. The second one is an addition I’ve argued for before.

With both relative reachability and AUP, I worry that any setting of the hyperparameters will lead to a violation of either the first desideratum (if the penalty is not large enough) or the second one (if the penalty is too large). For intermediate settings, both desiderata would be violated.

When we infer preferences from the state of the world, we are definitely giving up on being value agnostic, but we are gaining significantly on the “do what we want” desideratum: the point of inferring preferences is that we do not also penalize positive impacts that we want to happen.

Test cases. You might wonder why we didn’t try using RLSP on the environments in relative reachability. The main problem is that those environments don’t satisfy our key assumption: that a human has been acting to optimize their preferences for some time. So if you try to run RLSP in that setting, it is very likely to fail. I think this is fine, because RLSP is exploiting a fact about reality that those environments fail to model.

(This is a general problem with benchmarks: they often do not include important aspects of the real problem under consideration, because the benchmark designers didn’t realize that those aspects were important for a solution.)

This is kind of related to the fact that we are not trying to be value agnostic—if you’re trying to come up with a value agnostic, objective measure of impact, then it would make sense that you could create some simple gridworld environments and claim that any objective measure of impact should give the same result on that environment, since one action is clearly more impactful than the other. However, since we’re not trying to be value agnostic, that argument doesn’t apply.

If you take the test cases, put them in a more realistic context, make your model of the world sufficiently large and powerful, don’t worry about compute, and imagine a variant of RLSP that somehow learns good features of the world, then I would expect that RLSP could solve most of the impact measure test cases.

What’s the point?

Before people start pointing out how a superintelligent AI system would game the preferences learned in this way, let me be clear: the goal is not to use the inferred preferences as a utility function. There are many reasons this is a bad idea, but one argument is that unless you have a good mistake model, you can’t exceed human performance—which means that (for the most part) you want to leave the state the way it already is.

In other words, we are also not trying to achieve the “Prevent catastrophe” desideratum above. We are instead going for the weaker goal of preventing some bad outcomes, and learning more of human preferences without increasing the burden on the human overseer.

You can also think of this as a contribution to the overall paradigm of value learning: the state of the world is an especially good source of information of our preferences on what not to do, which are particularly hard to get feedback on.

If I had to point towards a particular concrete path to a good future, it would be the one that I outlined in Following human norms. We build AI systems that have a good understanding of “common sense” or “how to behave normally in human society”; they accelerate technological development and improve decision-making; if we really want to have a goal-directed AI that is not under our control but that optimizes for our values then we solve the full alignment problem in the future. Inferring preferences or norms from the world state could be a crucial part of helping our AI systems understand “common sense”.


There are a bunch of reasons why you couldn’t take RLSP, run it on the real world and hope to get a set of preferences that prevent you from causing negative impacts. Many of these are interesting directions for future work:

Things we don’t affect. We can’t affect quasars even if we wanted to, and so quasars are not optimized for our preferences, and RLSP will not be able to infer anything about our preferences about quasars.

We are optimized for the environment. You might reply that we don’t really have strong preferences about quasars (but don’t we?), but even then evolution has optimized us to prefer our environment, even though we haven’t optimized it. For example, you could imagine that RLSP infers that we don’t care about the composition of the atmosphere, or infers that we prefer there to be more carbon dioxide in the atmosphere. Thanks to Daniel Filan for making this point way back at the genesis of this project.

Multiple agents. RLSP assumes that there is exactly one human acting in the environment; in reality there are billions, and they do not have the same preferences.

Non-static preferences. Or as Stuart Armstrong likes to put it, our values are underdefined, changeable, and manipulable, whereas RLSP assumes they are static.

Not robust to misspecification and imperfect models. If you have an incorrect model of the dynamics, or a bad featurization, you can get very bad results. For example, if you can tell the difference between dusty vases and clean vases, but you don’t realize that by default dust accumulates on vases over time, then you infer that Alice actively wants her vase to be dusty.

Using finite-horizon policy for Alice instead of an infinite-horizon policy. The math in RLSP assumes that Alice was optimizing her reward over an episode that would end exactly when the robot is deployed, so that the observed state is Alice’s “final state”. This is clearly a bad model, since Alice will still be acting in the environment after the robot is deployed. For example, if the robot is deployed the day before Alice is scheduled to move, the robot might infer that Alice really wants there to be a lot of moving boxes in her living space (rather than realizing that this is an instrumental goal in a longer-term plan).

There’s no good reason for using a finite horizon policy for Alice. We were simply following Maximum Causal Entropy IRL, which makes this assumption (which is much more reasonable when you observe demonstrations rather than the state of the world), and didn’t realize our mistake until we were nearly done. The finite horizon version worked sufficiently well that we didn’t redo everything with the infinite horizon case, which would have been a significant amount of work.

No nominations.
No reviews.