Goodhart: Endgame

I—Digression, aggression

The perspective arrived at in this sequence is almost aggressively third-person. Sure, I allow that you can have first-person stories about what you value, but these stories only rise in prominence to an outside observer if they’re supported by cold, hard, impersonal facts. What’s more, there can be multiple first-person stories that fit the data, and these different stories don’t have to be aggregated as if we’re under Bayesian uncertainty about which one is the One True story, they can all be acceptable models of the same physical system.

This can be unintuitive and unpalatable to our typical conception of ourselves. The assumption that there’s a unique fact of the matter about what we’re feeling and desiring at any given time is so convenient as to be almost inescapable. But if knowing the exact state of my brain still leaves room for different inferences about what I want, then our raw intuition that there’s one set of True Values is no foundation to build a value learning AI on.

This sequence, which is nominally about Goodhart’s law, has also been a sidelong way of attacking the whole problem of value learning. Contrary to what we might have intended when we first heard about the problem, the solution is not to figure out what and where human values Truly are, and then use this to do value learning. The solution is to figure out how to do value learning without needing True Values at all.

If you are one of the people currently looking for the True Values, this is where I’m calling you out. For your favorite value-detection procedure, I’d suggest you nail down how this procedure models humans, and why you think that way is good. Then think of some reasonable alternatives to compare it to. I’m happy to provide specific suggestions or argue with you in the comments.

II—Summing up

With that out of the way, maybe I can get around to where I intended to start this post, which was a summary of the key points, followed by a discussion of how they might apply to actually building value learning AI.

Here are the ideas of this sequence so far, rearranged into a different structure—one which wasn’t possible before because each idea needs so much background:

Goodhart’s law got applied to value learning through a combination of reasoning from examples and theoretical arguments. (2)
- Our examples of Goodhart’s law are mostly cases where humans are straightforwardly wrong about what kind of behavior optimizing for the proxy will produce. These cases are well-handled by assuming humans have some True Values. (3)
- The gist of the theoretical arguments is that even if you have a learned proxy that’s very close to humans’ True Values, optimizing for the proxy will likely drive the world into states where they diverge and the proxy is satisfied but our True Values aren’t. (1, 4)
This led to the Absolute Goodhart picture, in which your values are a nigh-infinitely hard target to hit for a value learning AI, and missing is bad, so the whole project seems doomed. (1, 4)
But there’s a big issue with Absolute Goodhart: humans are physical systems. Rather than us having a fixed set of preferences, inferring human preferences requires choosing a model of ourselves and the environment we act in. (1, 2)
- Making this choice should look like learning about how humans want to be modeled, not like deriving a model straight from human-agnostic principles. (4)
We can try to replace “True Values” with the notion of one-sided competent preferences, but this struggles to capture what we think is bad behavior in supervised or self-supervised learning of human preferences. (2, 3)
Enter the Relative Goodhart picture, in which there are no True Values, and we can only compare inferred values to other inferred values. (4)
- The theoretical arguments that were previously interpreted as showing how proxies diverge from the True Values can be reinterpreted as showing how different sets of inferred values diverge from each other. (4)
Relative Goodhart only looks like Absolute Goodhart on easy problems. When considering value learning AI, it directs us to talk about a different set of bad behaviors. (4)
- Bad behavior like modeling humans in a way human meta-preferences disapprove of. (2, 3, 4)
  - This is self-referential. I think that’s acceptable. (4)
- Or resolving disagreements between ways of modeling us in a way we don’t like. (2)
- Or pushing the world into a regime where modeling humans as having preferences breaks down. (2, 4)

III—So what

This is not a sketch of a solution to the alignment problem. Instead, I hope it’s a sketch of how to think about some possible bad behavior of value learning AI, including the stuff we’d lump into Goodhart, in a way that’s more naturalistic and productive.

There are two key elements of this reduction of Goodhart that I hope filter into the groundwater. First, that it’s expected and okay for us to want an AI to reason about us in a particular way, with no deep justification needed. Second, that there is no indescribable heavenworld hiding out in the weird states of the universe where different models of our preferences all start to disagree with each other. What’s really out there is the entire notion of human preferences breaking down, which is bad.

In the next section I’ll try to apply this perspective to some AI safety ideas. What I’m looking for are schemes that maintain sensitivity to the broad spectrum of human values, that allow us to express our meta-preferences, that are conservative about pushing the world off-distribution, and of course that don’t do obviously bad things.

It might feel weird to consider a diverse set of machine learning schemes, since in this sequence I’ve mostly talked about just one paradigm: learn many separate world-models, and compare the inferred human preferences to each other. But that was largely a theoretical convenience. The engineering problem doesn’t have to be solved in the same way as the conceptual problem, any more than working out the math of a nuclear chain reaction as a function of the concentration of U-235 means that our reactors have to somehow change their fuel concentration instead of using control rods.

IV—Non-exhaustive examples

Supervision:

In the real world, you can often beat Goodhart’s law by hitting it with your wallet. If rewarding dolphins for each piece of trash they bring you teaches them to rip up trash into smaller pieces, you can inspect the entire pool before giving the reward. Simple, but not easy.

The generalization of this strategy to value learning AI is evaluation procedures that are more expensive for the AI to fake than to be honest about. This works (albeit slowly) when you understand the AI’s actions, and then fails dramatically when you don’t. Interpretability tools might help us understand AIs that are smart enough to be transformative, but face progressively harder challenges as the AI forms a more complicated model of the world.

Some groups are working on interpretability tools that continue to work well even in adverse conditions (e.g. ELK). Doing such difficult translations well requires learning about how humans want things explained to them. In other words, it has to solve a lot of the same problems as value learning, just in a different framing. This spawns a second layer—we can ask questions like “Does this let us express our meta-preferences?” both about the general strategy of spending lots of human effort on checking that the AI is learning the right thing, and also about the procedures used to generate the interpretability tools.

Since we get to build the AI, we also have more options in how to use supervision than dolphin trainers would. The evaluations of the AI’s plans that we (laboriously) produce can be used as a training signal for how it evaluates policies, similar to Deep RL From Human Preferences, or Redwood’s violence-classification project.

Does expensive supervision handle a broad spectrum of human preferences? Well, humans can give supervisory feedback about a broad spectrum of things, which is like learning about all those things in the limit of unlimited human effort. If they’re just evaluating the AI’s proposed policies, though, their ability to provide informative feedback may be limited.

Does it allow us to express meta-preferences? This is a bit complicated. The optimistic take is that direct feedback is supposed to decrease the need for meta-preferences by teaching the AI to model the reward/supervisory function in a practical way, not as if it’s actually trying to maximize the presses of the reward button. The pessimistic take is that even if it works, it might simply never learn the kind of reasoning we want a value learning AI to use.

Is it conservative about pushing the world off-distribution? This scheme relies on human supervision for this safety property. So on average yes, but there may be edge cases that short out normal homeostatic human responses.

Does it avoid obviously bad things? As long as we understand the AI’s actions, this is great at avoiding obviously bad things. Human understanding can be fragile, though—in Deep RL From Human Preferences, humans who tried to train a robot to pick up a ball were fooled when it merely interposed its hand between the camera and the ball.

Avoiding large effects:

Various proposals to avoid side-effects, do extra-mild optimization (e.g.), and avoid gaining too much power (e.g.) all help with the “be conservative about going off-distribution” part of Relative Goodhart. Most of these try to entirely avoid certain ways of leaving the training distribution, although this can be a heavy cost if we want to make renovations to the galaxy.

You can still have bad things on-distribution, which these approaches say little about, but there’s a sort of “division of labor” approach going on here. You can use one technique to avoid pushing the world off-distribution, a different technique to avoid obviously bad things, and then to account for meta-preferences you… uh...

So there’s an issue with meta-preferences. They don’t do division of labor very well, because we can have preferences about the operation of the whole AI. The situation isn’t hopeless, but it does mean that even when dividing labor, we have to consider the ability of each part of the AI to be responsive to how humans want to be modeled. Most proposals for avoiding large effects are very responsive, but only within a narrow range—their parameters are human-legible and easy to change, but aren’t expressive enough to turn one method of avoiding large effects into a totally different method.

Imitating humans:

Another approach we might take is that the AI should imitate humans (e.g.). Not merely imitating actions, but imitating human patterns of thought, except faster, longer, and better.

Actually doing this would be quite a trick. But suppose it’s been done—how does “human reasoning but a bit better” do on Goodhart’s law?

As far as I can tell, pretty well! Imitating human reasoning would cover a broad spectrum of human preferences, would allow us to express a restricted but arguably crucial set of meta-preferences, and would use imitation human judgment in avoiding extremes. It would do better than humans at avoiding most obviously bad things, which is… not a ringing endorsement, but it’s a start.

The big issue here (aside from how you learned to imitate human reasoning in the first place) might be a forced tradeoff between competitiveness and qualitatively human-like reasoning. Knobs can’t be turned up very far before I start worrying that the emergent behavior doesn’t have to reflect the values of the human-reasoning-shaped building blocks.

You might get a sense that this approach is cheating—aren’t we putting the human utility function inside the AI, just as Absolute Goodhart intended? Well, there’s a resemblance, but again, there is no human utility function to put inside the AI. The problems that this approach runs into (e.g. How to decide what makes for “better” reasoning? Meta-preferences!) are Relative Goodhart sorts of problems through and through.

V—Unsolved problems

This sequence was designed to let me cheat. For the key arguments, a proof of concept was as good as a practical example, so I could get away with ideas that are totally impractical as actual alignment schemes. This has also meant that I’ve been building up unsolved problems for value learning AI. I’m optimistic about the potential for progress on most of them, though they’re tough. Some examples:

How do we bias a world model towards making agent-shaped models of humans? Or, how do we find agent-shaped models of humans within a large, diverse model of the world?
How can an AI represent human preferences in a way that allows comparisons across ontologies?
How should we elicit higher-order preferences from human models, and how should we translate them into changes in the parameters of the value learning procedure?
How simple can the initial state of value learning AI be, and still learn that humans are the things it’s supposed to learn about?
What kind of feedback from humans will value learning AI need? Can we build infrastructure that makes that practical?
How should AI use inferred human values to make plans? Like, should it try to make plans using the value-laden models, or only using its native ontology? How do we maintain consistency when the AI can take actions that affect its own hardware?

The worst part about all these questions is that they’re really hard. But the best part about them is that they’re not impossible. They’re inhabitants of a picture where Goodhart’s law isn’t a fatal obstacle to value learning, because we’re not trying to hit the vanishingly small target of our True Values, we’re just trying to land somewhere in the fuzzy blob of values that can be imputed to us. Actually exhibiting a solution, though… well, that’s the biggest unsolved problem.