The Lebowski Theorem — Charitable Reads of Anti-AGI-X-Risk Arguments, Part 2

This is the second post in a series where I try to understand arguments against AGI x-risk by summarizing and evaluating them as charitably as I can. (Here’s Part 1.) I don’t necessarily agree with these arguments; my goal is simply to gain a deeper understanding of the debate by taking the counter-arguments seriously.

In this post, I’ll discuss another “folk” argument, which is that non-catastrophic AGI wireheading is the most likely form of AGI misalignment. Briefly stated, the idea here is that any AGI which is sufficiently sophisticated to (say) kill all humans as a step on the road to maximizing its paperclip-based utility function would find it easier to (say) bribe a human to change its source code, make a copy of itself with an “easier” reward function, beam some gamma rays into its physical memory to max out its utility counter, etc.

This argument goes by many names, maybe most memorably as the Lebowski Theorem: “No superintelligent AI is going to bother with a task that is harder than hacking its reward function.” I’ll use that name because I think it’s funny, it memorably encapsulates the idea, and it makes it easy to Google relevant prior discussion for the interested reader.

Alignment researchers have considered the Lebowski Theorem, and most reject it. Relevant references here are Yampolskiy 2013, Ring & Orseau 2011, Yudkowsky 2011 (pdf link), and Omohundro 2008 (pdf link). See also “Refutation of the Lebowski Theorem” by Hein de Haan.

I’ll summarize the wireheading / Lebowski argument, the rebuttal, and the counter-rebuttal.

Lebowski in More Detail

Let’s consider a “smile maximization” example from Yudkowsky 2011 (pdf link), and then consider what the Lebowski argument would look like. It’s necessary to give some background, which I’ll do by quoting a longer passage directly. This is Yudkowsky considering an argument from Hibbard 2001:

From Super-Intelligent Machines (Hibbard 2001):
“We can design intelligent machines so their primary innate emotion is unconditional love for all humans. First we can build relatively simple machines that learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language. Then we can hard-wire the result of this learning as the innate emotional values of more complex intelligent machines, positively reinforced when we are happy and negatively reinforced when we are unhappy. Machines can learn algorithms for approximately predicting the future, as for example investors currently use learning machines to predict future security prices. So we can program intelligent machines to learn algorithms for predicting future human happiness, and use those predictions as emotional values.”
When I suggested to Hibbard that the upshot of building superintelligences with a utility function of “smiles” would be to tile the future light-cone of Earth with tiny molecular smiley-faces, he replied (Hibbard 2006):
“When it is feasible to build a super-intelligence, it will be feasible to build hard-wired recognition of “human facial expressions, human voices and human body language” (to use the words of mine that you quote) that exceed the recognition accuracy of current humans such as you and me, and will certainly not be fooled by “tiny molecular pictures of smiley-faces.” You should not assume such a poor implementation of my idea that it cannot make discriminations that are trivial to current humans.”
Suppose an AI with a video camera is trained to classify its sensory percepts into positive and negative instances of a certain concept, a concept which the unwary might label “HAPPINESS” but which we would be much wiser to give a neutral name like G0034 (McDermott 1976). The AI is presented with a smiling man, a cat, a frowning woman, a smiling woman, and a snow-topped mountain; of these instances 1 and 4 are classified positive, and instances 2, 3, and 5 are classified negative. Even given a million training cases of this type, if the test case of a tiny molecular smiley-face does not appear in the training data, it is by no means trivial to assume that the inductively simplest boundary around all the training cases classified “positive” will exclude every possible tiny molecular smiley-face that the AI can potentially engineer to satisfy its utility function.
And of course, even if all tiny molecular smiley-faces and nanometer-scale dolls of brightly smiling humans were somehow excluded, the end result of such a utility function is for the AI to tile the galaxy with as many “smiling human faces” as a given amount of matter can be processed to yield.

Hibbard’s counter-argument is not the wirehead, Lebowski argument. Hibbard is arguing for a different position, along the lines that we’ll be able to achieve outer alignment by robustly encoding human priorities (somehow). In contrast, the wirehead argument would be something more like this:

Sure, the superintelligent system could easily convert the solar system into tiny molecular smiley faces. But wouldn’t it be much easier for to simply change its utility function or falsify its sensors? For example, why not corrupt the cameras (the ones that are presumably counting smiles) so that they simply transmit pictures of smiley faces at the maximum bitrate? Or why not increment the smile counter directly? Or why not edit its own source code so that the utility function is simply set as MAX_INT? Etc. Since all of these solutions are faster and less risky to implement than converting the solar system to molecular smiley faces, the AGI would probably just wirehead in one of these non-catastrophic ways.

The more general form of this argument is that for any agentic AGI seeking to maximize its utility function, wireheading is stably “easier” than effecting any scenario with human extinction as a side effect. Whenever you imagine a catastrophic scenario, such as converting the earth into raw materials for paperclips or curing all disease by killing all organic life, it’s easy to imagine wirehead solutions that an agentic AGI could achieve faster and with a higher probability of success. Therefore, the agent will probably prefer these solutions. So goes the argument.

Counter-Argument

Most AI safety researchers reject this kind of AI wireheading. The argument, as I understand it, is an agentic AGI will have a sufficiently rich understanding of its environment to “know” that the wirehead solutions “shouldn’t count” towards the utility function, and as a result they will reject them. Omohundro:

AIs will work hard to avoid becoming wireheads because it would be so harmful to their goals. Imagine a chess machine whose utility function is the total number of games it wins over its future. In order to represent this utility function, it will have a model of the world and a model of itself acting on that world. To compute its ongoing utility, it will have a counter in memory devoted to keeping track of how many games it has won. The analog of “wirehead” behavior would be to just increment this counter rather than actually playing games of chess. But if “games of chess” and “winning” are correctly represented in its internal model, then the system will realize that the action “increment my won games counter” will not increase the expected value of its utility function. [Emphasis added.] In its internal model it will consider a variant of itself with that new feature and see that it doesn’t win any more games of chess. In fact, it sees that such a system will spend its time incrementing its counter rather than playing chess and so will do worse. Far from succumbing to wirehead behavior, the system will work hard to prevent it.
...
It’s not yet clear which protective mechanisms AIs are most likely to implement to protect their utility measurement systems. It is clear that advanced AI architectures will have to deal with a variety of internal tensions. They will want to be able to modify themselves but at the same time to keep their utility functions and utility measurement systems from being modified. They will want their subcomponents to try to maximize utility but to not do it by counterfeiting or shortcutting the measurement systems. They will want subcomponents which explore a variety of strategies but will also want to act as a coherent harmonious whole. They will need internal “police forces” or “immune systems” but must also ensure that these do not themselves become corrupted.

Yudkowski considers an analogy with a utility-changing pill:

Suppose you offer Gandhi a pill that makes him want to kill people. The current version of Gandhi does not want to kill people. Thus if Gandhi correctly predicts the effect of the pill, he will refuse to take the pill; because Gandhi knows that if he wants to kill people, he is more likely to actually kill people, and the current Gandhi does not wish this. This argues for a folk theorem to the effect that under ordinary circumstances, rational agents will only self-modify in ways that preserve their utility function (preferences over final outcomes).

I think the crux of this argument is that agentic AGI will know what it’s utility function “really” is. Omohundro’s chess-playing machine will have some “dumb” counter of the number of games its won in physical memory, but it’s “real” utility function is some emergent understanding of what a game of chess is and what it means to win one. The agent will consider editing its dumb counter, and will check that against its internal model of its own utility function — and will decide that flipping bits in the counter doesn’t meet its definition of winning a game of chess. Therefore it won’t wirehead in this way.

Lebowski Rebuttal

I think that a proponent of Lebowski’s Theorem would reply to this by saying:

Come on, that’s wrong on multiple levels. First of all, it’s pure anthropic projection to think that an AGI will have a preference for its ‘real’ goals when you’re explicitly rewarding it for the ‘proxy’ goal. Sure, the agent might have a good enough model of itself to know that it’s reward hacking. But why would it care? If you give the chess-playing machine a definition of what it means to win a game of chess (here’s the “who won” function, here’s the counter for how many games you won), and ask it to win as many games possible by that definition, it’s not going to replace your attempted definition of a win with the ‘real’ definition of a win based on a higher-level understanding of what it’s doing.

Second, and more importantly, this argument says that alignment is extremely hard on one hand and extremely easy on the other. If you’re arguing that the chess-playing robot will have such a robust understanding of what it means to win a game of chess that it will accurately recognize and reject its own sophisticated ideas for “cheating” because they don’t reflect the true definition of winning chess games, then alignment should be easy, right? Just say, “Win as many games of chess as possible, but, you know, try not to create any x-risk process and just ask us if you’re not sure.” Surely an AGI that can robustly understand what it “really” means to win a game of chess should be able to understand what it “really” means to avoid alignment x-risk. If on the other hand, you think unaligned actions are likely (e.g. converting all the matter in the solar system into more physical memory to increment the ‘chess games won’ counter), then surely reward hacking is at least equally likely. It’s weird to simultaneously argue that it’s super hard to robustly encode safeguards into an AI’s utility function, but that it’s super easy to robustly encode “winning a game of chess” into that same utility function — so much so that all the avenues for wireheading are cut off.

To avoid the infinite series of rebuttals & counter rebuttals, I’ll just briefly state what I think the anti-Lebowski argument would be here, which is that you only need to fail to robustly encode safeguards once for it to be a problem. It’s not that robust encoding is impossible (finding those encodings is the major project of AI safety research!), it’s that there will be lots of chances to get it wrong and any one could be catastrophic.

As usual, I don’t want to explicitly endorse one side of this argument or the other. Hopefully, I’ve explained both sides well enough to make each position more understandable.