Evaluating the historical value misspecification argument

ETA: I’m not saying that MIRI thought AIs wouldn’t understand human values. If there’s only one thing you take away from this post, please don’t take away that. Here is Linch’s attempted summary of this post, which I largely agree with.

Recently, many people have talked about whether some of the main MIRI people (Eliezer Yudkowsky, Nate Soares, and Rob Bensinger[1]) should update on whether value alignment is easier than they thought given that GPT-4 seems to follow human directions and act within moral constraints pretty well (here are two specific examples of people talking about this: 1, 2). Because these conversations are often hard to follow without much context, I’ll just provide a brief caricature of how I think this argument has gone in the places I’ve seen it, which admittedly could be unfair to MIRI[2]. Then I’ll offer my opinion that, overall, I think MIRI people should probably update in the direction of alignment being easier than they thought in light of this information, despite their objections.

Note: I encourage you to read this post carefully to understand my thesis. This topic can be confusing, and there are many ways to misread what I’m saying. Also, make sure to read the footnotes if you’re skeptical of some of my claims.

Here’s my very rough caricature of the discussion so far, plus my response:

Non-MIRI people: Yudkowsky talked a great deal in the sequences about how it was hard to get an AI to understand human values. For example, his essay on the Hidden Complexity of Wishes made it sound like it would be really hard to get an AI to understand common sense. In that essay, the genie did silly things like throwing your mother out of the building rather than safely carrying her out. Actually, it turned out that it was pretty easy to get an AI to understand common sense. LLMs are essentially safe-ish genies that do what you intend. MIRI people should update on this information.

MIRI people (Eliezer Yudkowsky, Nate Soares, and Rob Bensinger): You misunderstood the argument. The argument was never about getting an AI to understand human values, but about getting an AI to care about human values in the first place. Hence ‘The genie knows but doesn’t care’. There’s no reason to think that GPT-4 cares about human values, even if it can understand them. We always thought the hard part of the problem was about inner alignment, or, pointing the AI in a direction you want. We think figuring out how to point an AI in whatever direction you choose is like 99% of the problem; the remaining 1% of the problem is getting it to point at the “right” set of values.[2]

My response:

I agree that MIRI people never thought the problem was about getting AI to merely understand human values, and that they have generally maintained there was extra difficulty in getting an AI to care about human values. However, I distinctly recall MIRI people making a big deal about the value identification problem (AKA the value specification problem), for example in this 2016 talk from Yudkowsky.[3] The value identification problem is the problem of “pinpointing valuable outcomes to an advanced agent and distinguishing them from non-valuable outcomes”. In other words, it’s the problem of specifying a utility function that reflects the “human value function” with high fidelity, i.e. the problem of specifying a utility function that can be optimized safely. See this footnote[4] for further clarification about how I view the value identification/​specification problem.

The key difference between the value identification/​specification problem and the problem of getting an AI to understand human values is the transparency and legibility of how the values are represented: if you solve the problem of value identification, that means you have an actual function that can tell you the value of any outcome (which you could then, hypothetically, hook up to a generic function maximizer to get a benevolent AI). If you get an AI that merely understands human values, you can’t necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.

The primary foreseeable difficulty Yudkowsky offered for the value identification problem is that human value is complex.[5] In turn, the idea that value is complex was stated multiple times as a premise for why alignment is hard.[6] Another big foreseeable difficulty with the value identification problem is the problem of edge instantiation, which was talked about extensively in early discussions on LessWrong.

MIRI people frequently claimed that solving the value identification problem would be hard, or at least non-trivial.[7] For instance, Nate Soares wrote in his 2016 paper on value learning, that “Human preferences are complex, multi-faceted, and often contradictory. Safely extracting preferences from a model of a human would be no easy task.”

I claim that GPT-4 is already pretty good at extracting preferences from human data. It exhibits common sense. If you talk to GPT-4 and ask it ethical questions, it will generally give you reasonable answers. It will also generally follow your intended directions, rather than what you literally said. Together, I think these facts indicate that GPT-4 is probably on a path towards an adequate solution to the value identification problem, where “adequate” means “about as good as humans”. And to be clear, I don’t mean that GPT-4 merely passively “understands” human values. I mean that GPT-4 literally executes your intended instructions in practice, and that asking GPT-4 to distinguish valuable and non-valuable outcomes works pretty well in practice, and this will become increasingly apparent in the near future as models get more capable and expand to more modalities.[8]

I’m not arguing that GPT-4 actually cares about maximizing human value. However, I am saying that the system is able to transparently pinpoint to us which outcomes are good and which outcomes are bad, with fidelity approaching an average human, albeit in a text format. Crucially, GPT-4 can do this visibly to us, in a legible way, rather than merely passively knowing right from wrong in some way that we can’t access. This fact is key to what I’m saying because it means that, in the near future, we can literally just query multimodal GPT-N about whether an outcome is bad or good, and use that as an adequate “human value function”. That wouldn’t solve the problem of getting an AI to care about maximizing the human value function, but it would arguably solve the problem of creating an adequate function that we can put into a machine to begin with.

Maybe you think “the problem” was always that we can’t rely on a solution to the value identification problem that only works as well as a human, and we require a much higher standard than “human-level at moral judgement” to avoid a catastrophe. But personally, I think having such a standard is both unreasonable and inconsistent with the implicit standard set by essays from Yudkowsky and other MIRI people. In Yudkowsky’s essay on the hidden complexity of wishes, he wrote,

You failed to ask for what you really wanted. You wanted your mother to go on living, but you wished for her to become more distant from the center of the building.

Except that’s not all you wanted. If your mother was rescued from the building but was horribly burned, that outcome would rank lower in your preference ordering than an outcome where she was rescued safe and sound. So you not only value your mother’s life, but also her health. [...]

Your brain is not infinitely complicated; there is only a finite Kolmogorov complexity /​ message length which suffices to describe all the judgments you would make. But just because this complexity is finite does not make it small. We value many things, and no they are not reducible to valuing happiness or valuing reproductive fitness.

I interpret this passage as saying that ‘the problem’ is extracting all the judgements that “you would make”, and putting that into a wish. I think he’s implying that these judgements are essentially fully contained in your brain. I don’t think it’s credible to insist he was referring to a hypothetical ideal human value function that ordinary humans only have limited access to, at least in this essay.[9]

Here’s another way of putting my point: In general, there are at least two ways that someone can fail to follow your intended instructions. Either your instructions aren’t well-specified and don’t fully capture your intentions, or the person doesn’t want to obey your instructions even if those instructions accurately capture what you want. Practically all the evidence that I’ve found seems to indicate that MIRI people thought that both problems would be hard to solve for AI, not merely the second problem.

For example, a straightforward reading of Nate Soares’ 2017 talk supports this interpretation. In the talk, Soares provides a fictional portrayal of value misalignment, drawing from the movie Fantasia. In the story, Mickey Mouse attempts to instruct a magical broom to fill a cauldron, but the broom follows the instructions literally rather than following what Mickey Mouse intended, and floods the room. Soares comments: “I claim that as fictional depictions of AI go, this is pretty realistic.”[10]

Perhaps more important to my point, Soares presented a clean separation between the part where we specify an AI’s objectives, and the part where the AI tries to maximizes those objectives. He draws two arrows, indicating that MIRI is concerned about both parts. He states, “My view is that the critical work is mostly in designing an effective value learning process, and in ensuring that the sorta-argmax process is correctly hooked up to the resultant objective function 𝗨:”[11]

In the talk Soares also says, “The serious question with smarter-than-human AI is how we can ensure that the objectives we’ve specified are correct, and how we can minimize costly accidents and unintended consequences in cases of misspecification.” I believe this quote refers directly to the value identification problem, rather than the problem of getting an AI to care about following the goals we’ve given it. This attitude is reflected in other MIRI essays.

The point of “the genie knows but doesn’t carewasn’t that the AI would take your instructions, know what you want, and yet disobey the instructions because it doesn’t care about what you asked for. If you read Rob Bensinger’s essay carefully, you’ll find that he’s actually warning that the AI will care too much about the utility function you gave it, and maximize it exactly, against your intentions[12]. The sense in which the genie “doesn’t care” is that it doesn’t care what you intended; it only cares about the objectives that you gave it. That’s not the same as saying the genie doesn’t care about the objectives you specified.

Given the evidence, it seems to me that the following conclusions are probably accurate:

  1. The fact that GPT-4 can reliably follow basic instructions, is able to distinguish moral from immoral actions somewhat reliably, and generally does what I intend rather than what I literally asked, is all evidence that the value identification problem is easier than how MIRI people originally portrayed it. While I don’t think the value identification problem has been completely solved yet, I don’t expect near-future AIs will fail dramatically on the “fill a cauldron” task, or any other functionally similar tasks.

  2. MIRI people used to think that it would be hard to both (1) specify an explicit function that corresponds to the “human value function” with fidelity comparable to the judgement of an average human, and (2) separately, get an AI to care about maximizing this function. The idea that MIRI people only ever thought (2) was the hard part appears false.[13]

  3. Non-MIRI people sometimes strawman MIRI people as having said that AGI would literally lack an understanding of human values. I don’t endorse this, and I’m not saying this.

  4. The “complexity of value” argument pretty much just tells us that we need an AI to learn human values, rather than hardcoding a utility function from scratch. That’s a meaningful thing to say, but it doesn’t tell us much about whether alignment is hard, especially in the deep learning paradigm; it just means that extremely naive approaches to alignment won’t work.

As an endnote, I don’t think it really matters whether MIRI people had mistaken arguments about the difficulty of alignment ten years ago. It matters far more what their arguments are right now. However, I do care about accurately interpreting what people said on this topic, and I think it’s important for people to acknowledge when the evidence has changed.

  1. ^

    I recognize that these people are three separate individuals and each have their own nuanced views. However, I think each of them have expressed broadly similar views on this particular topic, and I’ve seen each of them engage in a discussion about how we should update about the difficulty of alignment given what we’ve seen from LLMs.

  2. ^

    I’m not implying MIRI people would necessarily completely endorse everything I’ve written in this caricature. I’m just conveying how they’ve broadly come across to me, and I think the basic gist is what’s important here. If some MIRI people tell me that this caricature isn’t a fair summary of what they’ve said, I’ll try to edit the post later to include real quotes.

    For now, I’ll point to this post from Nate Soares in which he stated,

    I have long said that the lion’s share of the AI alignment problem seems to me to be about pointing powerful cognition at anything at all, rather than figuring out what to point it at.

    It’s recently come to my attention that some people have misunderstood this point, so I’ll attempt to clarify here.

  3. ^

    More specifically, in the talk, at one point Yudkowsky asks “Why expect that [alignment] is hard?” and goes on to tell a fable about programmers misspecifying a utility function, which then gets optimized by an AI with disastrous consequences. My best interpretation of this part of the talk is that he’s saying the value identification problem is one of the primary reasons why alignment is hard. However, I encourage you to read the transcript yourself if you are skeptical of my interpretation.

  4. ^

    I am mainly talking about the problem of how to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans. I think this is simply a distinct concept from the idea of getting an AI to understand human values.

    I was not able to find a short and crisp definition of the value identification/​specification problem from MIRI. However, in the Arbital page on the Problem of fully updated deference, the problem is described as follows,

    One way to look at the central problem of value identification in superintelligence is that we’d ideally want some function that takes a complete but purely physical description of the universe, and spits out our true intended notion of value V in all its glory. Since superintelligences would probably be pretty darned good at collecting data and guessing the empirical state of the universe, this probably solves the whole problem.

    This is not the same problem as writing down our true V by hand. The minimum algorithmic complexity of a meta-utility function ΔU which outputs V after updating on all available evidence, seems plausibly much lower than the minimum algorithmic complexity for writing V down directly. But as of 2017, nobody has yet floated any formal proposal for a ΔU of this sort which has not been immediately shot down.

    In MIRI’s 2017 technical agenda, they described the problem as follows, which I believe roughly matches how I’m using the term,

    A highly-reliable, error-tolerant agent design does not guarantee a positive impact; the effects of the system still depend upon whether it is pursuing appropriate goals. A superintelligent system may find clever, unintended ways to achieve the specific goals that it is given. Imagine a superintelligent system designed to cure cancer which does so by stealing resources, proliferating robotic laboratories at the expense of the biosphere, and kidnapping test subjects: the intended goal may have been “cure cancer without doing anything bad,” but such a goal is rooted in cultural context and shared human knowledge.

    It is not sufficient to construct systems that are smart enough to figure out the intended goals. Human beings, upon learning that natural selection “intended” sex to be pleasurable only for purposes of reproduction, do not suddenly decide that contraceptives are abhorrent. While one should not anthropomorphize natural selection, humans are capable of understanding the process which created them while being completely unmotivated to alter their preferences. For similar reasons, when developing AI systems, it is not sufficient to develop a system intelligent enough to figure out the intended goals; the system must also somehow be deliberately constructed to pursue them (Bostrom 2014, chap. 8).

    However, the “intentions” of the operators are a complex, vague, fuzzy, context-dependent notion (Yudkowsky 2011; cf. Sotala and Yampolskiy 2017). Concretely writing out the full intentions of the operators in a machine-readable format is implausible if not impossible, even for simple tasks. An intelligent agent must be designed to learn and act according to the preferences of its operators.6 This is the value learning problem.

    Directly programming a rule which identifies cats in images is implausibly difficult, but specifying a system that inductively learns how to identify cats in images is possible. Similarly, while directly programming a rule capturing complex human intentions is implausibly difficult, intelligent agents could be constructed to inductively learn values from training data.

  5. ^

    To support this claim, I’ll point out that the Arbital page for the value identification problem says, “A central foreseen difficulty of value identification is Complexity of Value”.

  6. ^

    For example, in this post, Yudkowsky gave “five theses”, one of which was the “complexity of value thesis”. He wrote, that the “five theses seem to imply two important lemmas”, the first lemma being “Large bounded extra difficulty of Friendliness.”, i.e. the idea that alignment is hard.

    Another example comes from this talk. I’ve linked to a part in which Yudkowsky begins by talking how human value is complex, and moves to talking about how that fact presents challenges for aligning AI.

  7. ^

    My guess is that the perceived difficulty of specifying objectives was partly a result of MIRI people expecting that natural language understanding wouldn’t occur in AI until just barely before AGI, and at that point it would be too late to use AI language comprehension to help with alignment.

    Rob Bensinger said,

    It’s true that Eliezer and I didn’t predict AI would achieve GPT-3 or GPT-4 levels of NLP ability so early (e.g., before it can match humans in general science ability), so this is an update to some of our models of AI.

    In 2010, Eliezer Yudkowsky commented,

    > I think controlling Earth’s destiny is only modestly harder than understanding a sentence in English.

    Well said. I shall have to try to remember that tagline.

  8. ^

    If you disagree that AI systems in the near-future will be capable of distinguishing valuable from non-valuable outcomes about as reliably as humans, then I may be interested in operationalizing this prediction precisely, and betting against you. I don’t think this is a very credible position to hold as of 2023, barring a pause that could slow down AI capabilities very soon.

  9. ^

    I mostly interpret Yudkowsky’s Coherent Extrapolated Volition as an aspirational goal for what we could best hope for in an ideal world where we solve every part of alignment, rather than a minimal bar for avoiding human extinction. In Yudkowsky’s post on AGI ruin, he stated,

    When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of ‘provable’ alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about, nor attaining an absolute certainty of an AI not killing everyone. When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, “please don’t disassemble literally everyone with probability roughly 1” is an overly large ask that we are not on course to get.

  10. ^

    I don’t think I’m taking him out of context. Here’s a longer quote from the talk,

    When Mickey runs this program, everything goes smoothly at first. Then:

    [Image of the cauldron overflowing with water]

    I claim that as fictional depictions of AI go, this is pretty realistic.

    Why would we expect a generally intelligent system executing the above program to start overflowing the cauldron, or otherwise to go to extreme lengths to ensure the cauldron is full?

    The first difficulty is that the objective function that Mickey gave his broom left out a bunch of other terms Mickey cares about.

  11. ^

    The full quote is,

    Another common thread is “Why not just tell the AI system to (insert intuitive moral precept here)?” On this way of thinking about the problem, often (perhaps unfairly) associated with Isaac Asimov’s writing, ensuring a positive impact from AI systems is largely about coming up with natural-language instructions that are vague enough to subsume a lot of human ethical reasoning:


    In contrast, precision is a virtue in real-world safety-critical software systems. Driving down accident risk requires that we begin with limited-scope goals rather than trying to “solve” all of morality at the outset.5

    My view is that the critical work is mostly in designing an effective value learning process, and in ensuring that the sorta-argmax process is correctly hooked up to the resultant objective function 𝗨:


    The better your value learning framework is, the less explicit and precise you need to be in pinpointing your value function 𝘝, and the more you can offload the problem of figuring out what you want to the AI system itself. Value learning, however, raises a number of basic difficulties that don’t crop up in ordinary machine learning tasks.

  12. ^

    This interpretation appears supported by the following quote from Rob Bensinger’s essay,

    When you write the seed’s utility function, you, the programmer, don’t understand everything about the nature of human value or meaning. That imperfect understanding remains the causal basis of the fully-grown superintelligence’s actions, long after it’s become smart enough to fully understand our values.

    Why is the superintelligence, if it’s so clever, stuck with whatever meta-ethically dumb-as-dirt utility function we gave it at the outset? Why can’t we just pass the fully-grown superintelligence the buck by instilling in the seed the instruction: ‘When you’re smart enough to understand Friendliness Theory, ditch the values you started with and just self-modify to become Friendly.’?

    Because that sentence has to actually be coded in to the AI, and when we do so, there’s no ghost in the machine to know exactly what we mean by ‘frend-lee-ness thee-ree’.

  13. ^

    It’s unclear to me whether MIRI people are claiming that they only ever thought (2) was the hard part of alignment, but here’s a quote from Nate Soares that offers some support for this interpretation IMO,

    I’d agree that one leg of possible support for this argument (namely “humanity will be completely foreign to this AI, e.g. because it is a mathematically simple seed AI that has grown with very little exposure to humanity”) won’t apply in the case of LLMs. (I don’t particularly recall past people arguing this; my impression is rather one of past people arguing that of course the AI would be able to read wikipedia and stare at some humans and figure out what it needs to about this ‘value’ concept, but the hard bit is in making it care.

    Even if I’m misinterpreting Soares here, I don’t think that would undermine the basic point that MIRI people should probably update in the direction of alignment being easier than they thought.