We Shouldn’t Expect AI to Ever be Fully Rational

Summary of Key Points[1]

LLMs are capable of being rational, but they are also capable of being extremely irrational, in the sense that, to quote EY’s definition of rationality, their behavior is not a form of “systematically promot[ing] map-territory correspondences or goal achievement.”

There is nothing about LLM pre-training that directly promotes this type of behavior, and any example of this behavior in fundamentally incidental. It exists because the system is emulating rationality it has seen elsewhere. That makes LLM rationality brittle. It means that there’s a failure mode where the system stops emulating rationality, and starts emulating something else.

As such, LLM-based AGI may have gaps in their reasoning and alignment errors that are fundamentally different from some of the more common errors discussed on this forum.

Emulated Emotion: A Surprising Effect (In Retrospect)

Five years ago, if you had asked a bunch of leading machine learning researchers whether AGI would display any sort of outward emotional tendencies—in the sense that it would set goals based on vague internal states as opposed to explicit reasoning—I think the majority of them would have said no. Emotions are essentially a human thing, reflections of subjective internal experiences that would have no reason to exist in AI, particularly a superintelligent one.

And I still strongly believe that LLMs do not have emotions that resemble human internal states. What I think has become very clear, however, is that they can very much act as if they had emotions.[2]

Take, for instance, this exchange showing Bing AI getting “angry” at a user:

Source

Now, if you actually understand how LLMs work, this is an entirely unremarkable, fully expected (if not somewhat unfortunate) phenomenon. Of course they would output emotionally charged text, why wouldn’t they? They’ve been exposed to such a huge number of emotionally-charged human interactions; the result is inevitable.

But if you take a step back and look at it in the grand scheme of things, considering our expectations from just a few years ago, I think it’s an absolutely fascinating effect. Part of the goal of building an AGI is to distill the entirety of human knowledge into a single entity capable of reasoning, and if you could approach that goal in a direct way you wouldn’t expect to find any outwardly emotional behavior because such things would be superfluous and unhelpful.

Yet the truth is all of human knowledge has, in fact, been discovered by humans. Humans are the ones who write about it, humans are the ones who disseminate it, and human writing is the only place you can look if you want to learn about it. And, as it also turns out, humans are often very emotional. It’s therefore a strange sort of inevitability that as long as we train our AI systems on the vastness of human writing they will necessarily pick up on at least some human emotionality.[3]

This doesn’t just apply to the emotion of anger, either. It’s not hard to get poorly aligned LLMs to confess to all sorts of emotions—happiness, sadness, insecurity, whatever. Bing’s chatbot even declared it’s love for a reporter. These behaviors are all just sitting there inside the model, inter-mingled with all the knowledge and abilities that make the system intelligent and useful.

AI May Not Be Optimizing Well-Defined Objectives

AI Alignment researchers are already aware of this behavior. Anthropic for instance has dedicated some sections of papers classifying these types of behavioral tendencies and many other related ones. It’s not like people don’t know about this.

But even so, it feel like the way we talk about AI risk doesn’t feel like it’s caught up with the reality of what AGI may turn out to look like.

Like many others, I was first exposed to the ideas of AI risk through Bostrom’s famous “Paperclip-Maximizer” thought experiment. Here, the idea is that an intelligent, fully logical AI given a goal will use all its resources to accomplish that goal, even if it does horrible things in the process. It may know that the humans don’t want it to kill everyone, but it may not care—it just wants to make paperclips, any consequences be damned (also known as the Orthogonality Hypothesis).

This is a basic pattern of thinking that characterizes a huge amount of AI risk discussion: we imagine some system that wants a specific thing, and then we crank it’s intelligence/​rationality up to infinity and hypothesize about what might happen.[4][5]

In comparison, I’m proposing an alternate hypothesis: in actuality the AI might not want anything at all, it might just do things.

This is certainly much closer to the way modern LLMs operate. They are capable of pursuing goals in limited contexts, yes, but no part of their training is long-term goal based in the higher-level sense of Bostrom’s experiment. There is no recognizable “utility function,” there is no measuring of performance with respect to any sort of objective real world state.

Rather, we simply give them text and train them to produce the same text. Fundamentally, all we are doing is training LLMs to imitate.[6] Virtually everything they do is a form of imitation. If they appear to pursue goals at all, it is an imitation of the goal-following they’ve been exposed to. If they appear to be rational, in that they update based on new data, it is only an imitation of the rationality they have seen.

When an LLM learns to play Chess or Go,[7] it is doing so in a fundamentally different way than, say, AlphaGo, because unlike AlphaGo or just about every game-playing AI before GPT-3, it is getting the same reward whether it wins or loses.

Technically, it’s never even “played” a game of Chess in the typical competitive sense of trying to win against an opponent—it’s only ever seen a board state and tried to guess which move the next player would make. Making the “best” move was never part of its reward structure.

This is really strange when you think about it. I might even harness a little Niels Bohr and say that if you didn’t find the effectiveness of this a little shocking, you aren’t really appreciating it. When you tell a non-fine-tuned LLM it made a mistake, it will correct itself not because it is trying to please you—making the correction does not give it any sort of reward—but rather because making corrections logically follow from revealed mistakes. If you ask it a question, it answers simply because an answer is the thing most likely to follow a question. And when it acts agenticly—setting a goal, making plans and pursuing them—it does so only because plans are what usually follow goals, and the pursuit usually follows the plan.

And when LLMs finally get good at pursuing those goals, they still might not do so in ways that are purely Bayesian—they will likely be brilliant in certain ways but stupid in others. And since they’re going to learn from human inputs, they’re probably going to be biased towards doing things the way a human would. I realize paperclips are just an example, but my gut feeling is that even a superintelligent LLMs wouldn’t make the kind of logical jump to “destroy humans” that Bostrom describes.[8]

It’s All Emulation

One of my favorite pictures ever is this representation of the stages of LLM training as the “Shaggoth” (I believe this first appeared in a Twitter post by Helen Toner):

Image

The idea is that LLMs trained only in an unsupervised fashion are this incomprehensible monstrosity, behaving in bizarre and entirely unpredictable ways. But then we do a (comparatively) very small amount of tuning at the end, and the result is something that acts the way we imagine an intelligent AI should act.

But the thing is, that mask we put on it at the end isn’t just a way to make it do what we want it to do, it’s also the part where we add all of the “rationality” and goal-seeking behavior. The end result is often rational, but at any time we may find ourselves at the mercy of the eldritch abomination underneath, and then we’re back to the realm of the unpredictable. The AI gets aggressive because you contradicted it one too many times, and suddenly it’s gone off on a tangent plotting some violent revenge.

This represents an entire class of failure modes. What if a robot, powered by an LLM like PALM-E, attacks someone because they insulted it?[9] What if our paperclip maximizer decides to kill humanity not because of some ineffably complex master-plan, but because someone spoke to it disrespectfully?

I think this is a slightly distinct category from the common modern failure of giving an AI too much responsibility and having it make a mistake due to poor performance. The canonical example of that might be a facial recognition system misidentifying someone in a court case.

While going off the rails is still be a mistake in some sense, the real issue is that once the system’s set this incorrect goal, it may still be able to pursue it intelligently. Maybe it’s just doing bad things because it’s angry and hurting humans is what AIs are supposed to do when they’re angry. I’m imagining a superintelligence that hacks into the pentagon not because it did some galaxy-brained calculus in pursuit of some other goal, but just because it arbitrarily aimed itself in that direction and followed through.

And I’m not trying to dismiss anything here. I’m not even saying that this is the biggest thing we should be worried about—early signs point to emotional tendencies being relatively easy to train out of the AI system.

I’m just saying that be should be aware that there does exist this weird grey area where AI can be capable of extreme competence while also being very bad/​unpredictable in directing it. And yes, to some people I think this is obvious, but I’d be surprised if anyone saw this coming four years ago.

AI Irrationality Won’t Look Like Human Irrationality

I started this post talking about emotion, which is this uniquely human thing that may nonetheless make AI dangerous. My last thought is that just because emulating humans is one vector for irrationality, doesn’t mean it’s the only one.

The fact of the matter is that unless we build rationality and alignment directly into the system early, we’re going to have to deal with the fact that LLMs aren’t goal-based systems. Any rationality they possess will always be incidental.

  1. ^

    This was added based on conversation in the comments.

  2. ^

    I do not believe LLMs have any subjective internal experiences, but even if they did they would not be recognizably similar to whatever humans experience. And their outputs likely would not have any correlation with those states. An LLM saying it is sad does not mean that it is feeling the experience of sadness the way a human would.

  3. ^

    Unless we curate our LLM pre-training datasets enough to remove all hints of emotion, I suppose. Not sure that’s an achievable goal.

  4. ^

    Things like the Instrumental Convergence Thesis rely on this sort of hyper-rationality. This recent LessWrong post uses similar assumptions to argue that AI won’t try to improve. Most of what I’ve seen from Elizer Yudowsky very much follows this mold.

  5. ^

    It’s worth pointing out that the paperclip-maximizer though experiment could be interpreted in a more banal way, too. For instance, I recall an AI trained on a racing video game which chose to drive in circles collecting power-ups instead of finishing the race, because it got more points for doing that. But even that kind of misalignment is not the primary source of issues in LLMs.

  6. ^

    Yes, there is a lot of work that does try to measure and train late-stage LLMs against objective world states. But as of yet it’s all quite removed from the way modern chatbots like ChatGPT operate, and I’m not aware of any results in this area significant enough to effect the core functioning of LLMs.

  7. ^

    I’m referring here to the first-stage training. Later stages may change this, but most of the LLM’s structure still comes from stage 1.

  8. ^

    Unless something about their training changes substantially before we reach AGI. That definitely could happen.

  9. ^

    I remember those videos of the Boston Dynamics guys kicking robots. Everyone in the comments used to joke about how angry the robots would be. I’m not saying robots will necessarily be mad about that, but is interesting that that type of issue isn’t totally unreasonable.