Misgeneralization as a misnomer

Here’s two different ways an AI can turn out unfriendly:

  1. You somehow build an AI that cares about “making people happy”. In training, it tells people jokes and buys people flowers and offers people an ear when they need one. In deployment (and once it’s more capable), it forcibly puts each human in a separate individual heavily-defended cell, and pumps them full of opiates.

  2. You build an AI that’s good at making people happy. In training, it tells people jokes and buys people flowers and offers people an ear when they need one. In deployment (and once it’s more capable), it turns out that whatever was causing that “happiness”-promoting behavior was a balance of a variety of other goals (such as basic desires for energy and memory), and it spends most of the universe on some combination of that other stuff that doesn’t involve much happiness.

(To state the obvious: please don’t try to get your AIs to pursue “happiness”; you want something more like CEV in the long run, and in the short run I strongly recommend aiming lower, at a pivotal act.)

In both cases, the AI behaves (during training) in a way that looks a lot like trying to make people happy. Then the AI described in (1) is unfriendly because it was optimizing the wrong concept of “happiness”, one that lined up with yours when the AI was weak, but that diverges in various edge-cases that matter when the AI is strong. By contrast, the AI described in (2) was never even really trying to pursue happiness; it had a mixture of goals that merely correlated with the training objective, and that balanced out right around where you wanted them to balance out in training, but deployment (and the corresponding capabilities-increases) threw the balance off.

Note that this list of “ways things can go wrong when the AI looked like it was optimizing happiness during training” is not exhaustive! (For instance, consider an AI that cares about something else entirely, and knows you’ll shut it down if it doesn’t look like it’s optimizing for happiness. Or an AI whose goals change heavily as it reflects and self-modifies.)

(This list isn’t even really disjoint! You could get both at once, resulting in, e.g., an AI that spends most of the universe’s resources on acquiring memory and energy for unrelated tasks, and a small fraction of the universe on doped-up human-esque shells.)

The solutions to these two problems are pretty different. To resolve the problem sketched in (1), you have to figure out how to get an instance of the AI’s concept (“happiness”) to match the concept you hoped to transmit, even in the edge-cases and extremes that it will have access to in deployment (when it needs to be powerful enough to pull off some pivotal act that you yourself cannot pull off, and thus capable enough to access extreme edge-case states that you yourself cannot).

To resolve the problem sketched in (2), you have to figure out how to get the AI to care about one concept in particular, rather than a complicated mess that happens to balance precariously on your target (“happiness”) in training.

I note this distinction because it seems to me that various people around these parts are either unduly lumping these issues together, or are failing to notice one of them. For example, they seem to me to be mixed together in “The Alignment Problem from a Deep Learning Perspective” under the heading of “goal misgeneralization”.

(I think “misgeneralization” is a misleading term in both cases, but it’s an even worse fit for (2) than (1). A primate isn’t “misgeneralizing” its concept of “inclusive genetic fitness” when it gets smarter and invents condoms; it didn’t even really have that concept to misgeneralize, and what shreds of the concept it did have weren’t what the primate was mentally optimizing for.)

(In other words: it’s not that primates were optimizing for fitness in the environment, and then “misgeneralized” after they found themselves in a broader environment full of junk food and condoms. The “aligned” behavior “in training” broke in the broader context of “deployment”, but not because the primates found some weird way to extend an existing “inclusive genetic fitness” concept to a wider domain. Their optimization just wasn’t connected to an internal representation of “inclusive genetic fitness” in the first place.)


In mixing these issues together, I worry that it becomes much easier to erroneously dismiss the set. For instance, I have many times encountered people who think that the issue from (1) is a “skill issue”: surely, if the AI were only smarter, it would know what we mean by “make people happy”. (Doubly so if the first transformative AGIs are based on language models! Why, GPT-4 today could explain to you why pumping isolated humans full of opioids shouldn’t count as producing “happiness”.)

And: yep, an AI that’s capable enough to be transformative is pretty likely to be capable enough to figure out what the humans mean by “happiness”, and that doping literally everybody probably doesn’t count. But the issue is, as always, making the AI care. The trouble isn’t in making it have some understanding of what the humans mean by “happiness” somewhere inside it;[1] the trouble is making the stuff the AI pursues be that concept.

Like, it’s possible in principle to reward the AI when it makes people happy, and to separately teach something to observe the world and figure out what humans mean by “happiness”, and to have the trained-in optimization-target concept end up wildly different (in the edge-cases) from the AI’s explicit understanding of what humans meant by “happiness”.

Yes, this is possible even though you used the word “happy” in both cases.

(And this is assuming away the issues described in (2), that the AI probably doesn’t by-default even end up with one clean alt-happy concept that it’s pursuing in place of “happiness”, as opposed to a thousand shards of desire or whatever.)

And I do worry a bit that if we’re not clear about the distinction between all these issues, people will look at the whole cluster and say “eh, it’s a skill issue; surely as the AI gets better at understanding our human concepts, this will become less of a problem”, or whatever.

(As seems to me to be already happening as people correctly realize that LLMs will probably have a decent grasp on various human concepts.)

  1. ^

    Or whatever you’re optimizing. Which, again, should not be “happiness”; I’m just using that as an example here.

    Also, note that the thing you actually want an AI optimizing for in the long term—something like “CEV”—is legitimately harder to get the AI to have any representation of at all. There’s legitimately significantly less writing about object-level descriptions of a eutopian universe, than of happy people, and this is related to the eutopia being significantly harder to visualize.

    But, again, don’t shoot for the eutopia on your first try! End the acute risk period and then buy time for some reflection instead.

Crossposted to EA Forum (45 points, 0 comments)