Models Modeling Models
Now that we have some more concrete thinking under our belt, it’s time to circle back on Goodhart’s law for value learners. What sorts of bad behavior are we imagining from future value-learning AI? What makes those behaviors plausible, and what makes them bad?
Let’s start with that last point first. Judgments of goodness or badness get contextualized by models, so our framing of Goodhart’s law depends on what models of humans we tolerate. When I say “I like dancing,” this is a different use of the word ‘like,’ backed by a different model of myself, than when I say “I like tasting sugar.” The model that comes to mind for dancing treats it as one of the chunks of my day, like “playing computer games” or “taking the bus.” I can know what state I’m in (the inference function of the model) based on seeing and hearing short scenes. Meanwhile, my model that has the taste of sugar in it has states like “feeling sandpaper” or “stretching my back.” States are more like short-term sensations, and the described world is tightly focused on my body and the things touching it.
The meta-model that talks about me having preferences in both of these models is the framing of competent preferences. If someone or something is observing humans, it looks for human preferences by seeing what the preferences are in “agent-shaped” models that are powerful for their size.
(At least, up to some finite amount of shuffling that’s like a choice of prior or universal Turing machine. Also note that the details of the definition of “agent-shaped” matter—we’ll come back to this under the umbrella meta-preferences.)
So when we call certain behavior “bad,” the usage of that word might carry with it the implication of what way of thinking about the world that judgment is situated in, like how “I like dancing” makes sense when situated in a model of chunks of my day. There’s not one True Model in which the True Meaning of the word “bad” is expressed, though there can still be regularities among the different notions of badness.
What were the patterns that stood out from my previous discussions of what humans think of as bad behavior in value learning?
The most common type of failure, especially in modern day AI, is when humans are actively wrong about what’s going to happen. They have something specific in mind when designing an AI, like training a boat to win the race, but then they run it and don’t get what they wanted. The boat crashes and is on fire. We could make the boat racing game more of a value learning problem by training on human demonstrations rather than the score, and crashing and being on fire would still be bad behavior.
For simple systems where humans are good at understanding the state space and picturing what they want, this is the only standard you need, but for more complicated systems (e.g. our galaxy) humans can only understand small parts or simple properties of the whole system, and we apply our preferences to those parts we can understand. From the inside, it can be hard to feel the distinction! We want things about tic-tac-toe or about the galaxy with the same set of emotions. What makes deciding what to do with the galaxy different is that we have these scattered preferences about different parts and patterns, and the different parts don’t stay neatly separate from each other. They can interact or overlap in ways that bring our preferences into conflict.
This is a key point. Inter-preference conflicts aren’t an issue that ever comes up if you think of humans as having a utility function, but they’re almost unavoidable if you think of humans as a physical systems with different possible models. The nail in the coffin is that we humans can’t fit the whole galaxy into our heads, nor could evolution fit it into our genes, and so out of necessity we have to use simple heuristics that work well pragmatically but don’t fit together perfectly. If humans don’t resolve their preference conflicts well, this can lead to bad behavior like thinking the grass is always greener on the other side of the decision tree.
Bad preference aggregation can also lead to new-ish bad behavior on the part of a value learner. This bad behavior can look like encountering a situation where humans are conflicted or inconsistent, and then resolving that conflict using a method that humans don’t agree with. An AI that resolves every deep and thorny moral dilemma by picking whichever answer leads to the most paperclips seems bad, even if it’s hard to put your finger on what will go wrong on the object level.
That’s an extreme example, though. A value learner can fail at resolving preference conflicts even in cases where the right choice seems obvious to humans. If I like dancing, and I like tasting sugar, it might seem obvious to me that what I shouldn’t do is never go dancing so that I can stay at home and continually eat sugar. The line between different sorts of bad behavior is blurry here. The obviousness that I shouldn’t become a sugar-hermit can be thought of either as me doing preference aggregation between preferences for tasting sugar and dancing, or as an object-level preference in a slightly more complicated model of my states and actions. We want both perspectives to give similar results.
What this illuminates is that humans have meta-preferences: preferences about how we should be modeled. These preferences are inferred from humans’ words and actions, just like other preferences. On one hand, like other preferences, they’re necessarily simple and can come into conflict, making our lives harder. On the other hand, like other preferences, their limited scope allows us some wiggle room in terms of satisfying them, making our lives easier.
Unfortunately, we can’t dive too deep into how preference aggregation should be done, here. It’s very hard, and I don’t know the solution, and also it’s outside the scope of this post. Just to give a taste, problems arise when we want to compare preferences in different ontologies. As with the dancing vs. sugar example, we could do this comparison by cashing out both models into one more fine-grained model. But it’s not okay to just treat the more fine-grained model on its own terms and use it to fit human preferences from their behavior. It comes back to my meta-preferences: I don’t want to be modeled in the most fine-grained way. That would lead to unpalatable positions like “whatever the human did, that’s what they wanted” or “the human wants to follow the laws of physics.” Resolving conflicts across ontologies can’t be done by looking for which is “correct,” we have to face head-on the problem of translation between equally-valid models of me, and resolve conflicts using meta-preferential principles, e.g. fairness.
One further complication of allowing meta-preferences into value learning is that if how you balance preferences depends on your preferences, where you end up is going to depend on where you started. This can lead to certain problems (Stuart), and we might want to better understand this process and make sure it leads somewhere sensible (me). However, some amount of this dynamic is essential; for starters, picking out humans as the things whose values we want to learn (rather than e.g. evolution) and insisting that human actions are at least a little bit correlated with our preferences have exactly the type signature of meta-preference. Learning human meta-preferences can push you around in meta-preference-space, but you’ve still got to start somewhere.
How does all this connect back to Goodhart? I propose that a lot of the feeling of unease when considering trusting value learning schemes reliant on human modeling—a lot of this feeling that small perturbations might lead to bad things happening—is because we don’t think they’re satisfying our meta-preferences. Without satisfactory application of meta-preferences, it seems like getting what we want out of a value learner would be a fragile shot in the dark, where deviations that seem “small” in units of bits of random noise might have a large impact on how good the future is. If you squint: Goodhart’s law.
If human preferences live in simplified models of the world, this raises an obvious question: should we only trust these preferences within the domains of validity of those models? Does this mean that the really good futures lie within the domain of validity of our preferences?
Long story short? Yes.
The rest of this section is the long story long.
What’s a domain of validity, anyhow? One way it can be is that the domain of validity comes bundled with the model of the world. This is like Newtonian mechanics coming with a disclaimer on it saying “not valid above 0.1 c.” This way keeps things nice and simple for our limited brains. But there’s another way that’s even nicer to reason about (but impractical for human use), which is that we could have a plethora of different models of the world, and where they broadly agree we call it a “domain of validity,” and as they agree less, we trust them less. When I talk about individual preferences having a domain of validity, we can translate this to there being many similar models that use variations on this preference, and there’s some domain where they more or less agree, but as you leave that domain they start disagreeing more and more.
One more wrinkle is that our models in this case have two outputs: they make predictions about the world, and they also contain inferences about human values. Sometimes they can agree about predictions but disagree about values, or vice versa. Which domain of validity do we care about—predictions or preferences?
Turns out it’s basically always preferences. Imagine I get dumped out the airlock of a spaceship into hard vacuum. Very quickly, modeling me as a person is going to stop making useful predictions (e.g. about my future motion), and it will be more pragmatic to model me as a bag of wet meat—vacuum is outside the predictive domain of validity of many person-level models of me. But my preferences about getting dumped out the airlock have no such problem—the models that predict me in day-to-day life all tend to agree that it’s bad.
This is a strong intuition pump for using the preferential domain of validity when aggregating, and not worrying too hard about predictive accuracy. This requires our magical cross-model preference translator, but we’ve already assumed that into existence anyhow. In the reverse case, where there are models that are equally good at predicting our actions, and equally satisfy meta-preferences, but are put in a situation where they disagree about which of our internal psychological states are “preferences,” it also seems reasonable that we care about the preferential domain of validity.
What would ever incentivize a person or AI to leave the domain of validity of our preferences? Imagine you’re trying to predict the optimal meal, and you make 10 different models of your preferences about food. If nine of these models think a meal would be a 2⁄10, and the last model thinks a meal would be a 1,000⁄10, you’d probably be pretty tempted to try that meal, right? Even if these models agree on all everyday cases, and even if they make identical predictions about your experiences during and after the meal. Even if the meal is Hades’ pomegranate and all your other models are trying to warn you of moral / psychological danger.
This becomes a question about how you’re aggregating models. Avoiding going outside the domain of validity looks like using an aggregation function that puts more weight on the pessimistic answers than the optimistic ones, even if the optimistic ones have been specially selected for being really optimistic. We’ve circled back to meta-preferences again; I don’t want one of my preferences or one way of modeling me to be super-duper-satisfied at the expense of all others. This is in (non-fatal) tension with what we invoked meta-preferences for in part II, which is that there are some preferences and some ways of modeling me that I prefer to others.
This ties back in to the fact that there is not One True way of modeling ourselves but we just don’t know which it is. Such an epistemic position on models of human preferences would necessitate certain rules for preference aggregation. (For example, if you treat extreme values in some some domain as evidence against a model being the True model, then this evidence would equally affect whether we trust that model in other situations.) Because we aren’t just trying to figure out which model of us is the One True model, it’s okay to violate such simple rules in our preference aggregation.
As a final section, let’s circle back to some of the arguments from Goodhart Taxonomy and see how they’re holding up in a framing where we have to compare models to other models, rather than to the True utility function.
The different types of Goodhart in that post are different reasons why a small perturbation of the proxy is likely to lead to large divergences of score according to the True Values. We can make a fairly illuminating transformation of these arguments by replacing “proxy” with “one model,” and “True utility function” with “other plausible models.” In this view, Goodhart processes drive apparently-similar models into disagreement with each other.
Old style: When optimizing for some proxy for value, worlds in which that proxy takes an extreme value are probably very different (drawn from a different distribution) than the everyday world in which the relationship between the proxy and true value was inferred, and this big change can magnify any discrepancies between the proxy and the true values.
New style: When optimizing for one model of human preferences, worlds in which that model takes an extreme value are probably very different than the everyday world from which that model was inferred, and this big change can magnify any discrepancies between similar models that used to agree with each other. Lots of model disagreement often signals to us that the validity of the preferences is breaking down, and we have a meta-preference to avoid this.
This transformation works very neatly for Extremal Goodhart, so I took the liberty of ordering it first in the list.
Old style: If you select for high value of a proxy, you select not just for signal but also for noise. You’ll predictably get a worse outcome than the naive estimate, and if there are some parts of the domain that have more noise without totally tanking the signal, the maximum value of the proxy is more likely to be there.
New style: If you select for high value according to one model of humans, you select not just for the component that agrees with the average model, but also the component that disagrees. Other models will predictably value your choice less then the model you’re optimizing, and if there are some parts of the domain that tend to drive this model’s estimates apart from the others’ without totally tanking the average value, the maximum value is more likely to be there.
Also, if you average all your models together and select for high average value you can still treat model disagreement like noise when it lacks obvious correlations. If there’s some region where your models disagree with each other, a lot, uncorrelated, the maximum average value will more likely be in that region. As with Extremal, we would rather not go to the part of phase space where the models of us all disagree with each other.
My addition of the variance-seeking pressure under the umbrella of Regressional Goodhart really highlights the similarities between it and Extremal Goodhart. Both are simplifications of the same overarching math, it’s just that in the Regressional case we’re doing even more simplification (requiring there to be a noise term with nice properties), allowing for a more specific picture of the optimization process.
Old style: If we pick a proxy to optimize that’s correlated with True Value but not sufficient to cause it, then there might be appealing ways to intervene on the proxy that don’t intervene on what we truly want.
New style: If we have two modeled preferences that are correlated, but one is actually the causal descendant of the other, then there might be appealing ways to intervene on the descendant preference that don’t intervene on the ancestor preference.
There’s a related potential issue when we have modeled preferences that are coarse-grainings or fine-grainings of each other. There can be ways to intervene on the fine-grained model that don’t intervene on the coarse-grained model.
These translated Goodhart arguments all make the same change, which is to replace failures according to particular True Values with unstable or undefined behavior. As Stuart Armstrong would put it, Goodhart’s law is model splintering for values.
Although this change may seem boring or otiose, I think it’s actually a huge opportunity. In the first post I complained that the naive framing of Goodhart’s law didn’t admit of solutions—now, this new-style framing changes something crucial. When comparing a model to the True Values, we didn’t know the True Values. But when comparing models to other models, nothing there is unknowable!
In the next and final post, the plan is to tidy this claim up a bit, see how it applies to various proposals for beating Goodhart’s law for value learning, and zoom out to talk about the bigger picture for at least a whole paragraph.