A thought experiment to help persuade skeptics that power-seeking AI is plausible

Introduction

Many technically-minded people are insufficiently worried about AI safety. It is important that we get better at persuading them. Here, I highlight one particular technical issue that can cause people to dismiss safety concerns. Through a dialogue, I then present a thought experiment that I have found to be effective in changing people’s minds on the issue.

Although the ideas involved may be obvious to members of this forum, they are not obvious to everyone. The aim of this brief post is to offer a pedagogical device that will help the reader be more persuasive when talking to AI safety skeptics.

TL;DR: The technical issue is “Why would a model power-seek upon deployment, if it never had the opportunity to do so during training?”. The thought experiment describes a simple, concrete and plausible story of how this might occur: we deploy a combined system formed of a paperclip-maximiser trained in a “safe” environment, coupled to a powerful LLM.

Dialogue

Alice (AI safety researcher): AIs seek to maximise reward. Humans might interfere with an AI’s operations, decreasing its expected reward. To prevent such interference, it’s plausible that a highly intelligent AI will try to seize power from humans.

Bob (AI safety skeptic): I think I buy your argument when it’s applied to RL agents that are let loose in the real world, continuing to get gradient updates. These models could start trying out some very small power-seeking actions, which gradually get reinforced as they are found to increase reward, eventually leading to dangerous levels of power-seeking.

But our current most powerful models are trained on simple tasks in environments that are nowhere near rich enough to allow for any power-seeking actions at all. Why would such a model power-seek given that gradients never encouraged it to do so during training?

Alice: Well, there’s this thing called goal misgeneralisation --

Bob: I know about goal misgeneralisation. I’m not denying that models misgeneralise in weird ways. I’m asking you why models should misgeneralise in the extremely specific weird way that you mentioned: power-seeking even when there was no opportunity to do so during training.

What I want is a description of a hypothetical AI trained in a simple environment with no opportunity for power-seeking, along with a plausible story for why it would power-seek upon deployment.

Alice: Ok, here’s a thought experiment. Consider a system comprised of two parts:

1) An extremely powerful LLM trained on next-token prediction

2) A moderately intelligent AI that I’ll call the “agent”, which is able to communicate with the LLM, and which is trained to operate a simulated factory with the objective of manufacturing as many paperclips as possible.

The combined system never had the opportunity to power-seek during its training: the LLM just looked at text and did next-token prediction, whilst the agent was trained in a simulated factory which we’ll assume contained no humans, only paperclip-making equipment.

Nonetheless, it will plausibly power-seek when deployed in a real-world factory.

To see why, suppose the LLM has already been trained, and consider the agent near the start of its own training. Making paperclips involves achieving several sub-goals, most of which the agent initally struggles with. For example, the agent might initially waste lots of steel during the production process. At some point, the agent happens to try asking the LLM “how can I minimise waste?” and doing whatever it recommends. Since the LLM is very intelligent, its advice is sound, and the agent receives increased reward. Over time, gradients reinforce the strategy of deferring waste-mitigation decisions to the LLM.

More generally, over the course of training, gradients strongly encourage the agent to adopt the broad strategy of asking the LLM for help, and doing whatever it advises.

Now let’s deploy the combined system in a real-world factory. The agent soon encounters situations it did not see in training. Say the company that owns the factory decides there’s not enough money in paperclips, and it will pivot to manufacturing ball-point pens. When humans come to repurpose the factory, the agent responds by doing what it learnt during training: it asks the LLM what it should do in order to keep running the paperclip-making machinery. Being very intelligent, the LLM understands that the humans will interfere with the ability to continue running the paperclip-making machines, and advises a strategy to stop them from doing so. The agent follows the LLM’s advice, as it learnt to do in training, and therefore begins to display power-seeking behaviour.

Bob: Fair enough, you gave me exactly what I asked for: an AI trained in a simple environment with no opportunity for power-seeking, which power-seeks when deployed. But come on, humans would never act as stupidly as the ones in your story. Why would the company telegraph the fact that the factory was going to be repurposed. Surely they’d just switch the agent off without warning!

Alice: The example I gave was just a proof-of-concept. Once you buy the fact that you can, in principle, get power-seeking behaviour from the system I described, the severity of the power-seeking is only a question of the LLM’s level of intelligence. A sufficiently smart LLM would anticipate the possibility for human interference long before it happened, and would advise the agent on how to prepare so that it could continue to make paperclips. Then we’re back to the standard playbook of AI self-preservation: gaining access to and distributing its weights, gaining political influence, covertly manufacturing nanobots...

Bob: Ok, fine, I’m convinced. Allowing an agent with a ‘real-world’ goal to exchange information with a superintelligent LLM does sound pretty dangerous. Can’t we just not build such a thing?

Alice: If only! Let’s ignore the fact that humans ourselves are good examples of agents with real-world goals, and some of us will probably soon get access to superintelligent LLMs. Let’s also ignore the fact that the paperclip-making system I described is exactly the sort of thing that people will race to build (just replace ‘maximising number of paperclips’ with ‘maximising the value of a portfolio’). Even if we set those facts aside, the point is that the separation between LLM and “agent” was totally artificial: I introduced it only as an explanatory device.

If you give an AI a bunch of text data and train it extremely hard to accomplish a goal, and the data contains information that would help it achieve the goal, then it will learn to do language modelling. It may not end up with exactly the same circuits as a next-token-predictor, but it will learn to extract useful information from the data you trained it on, since this increases reward. Part of its internals will play the role of an LLM. And by the thought experiment I gave you, it will plausibly power-seek upon deployment.

Sufficiently intelligent systems may become power-seeking pretty generically. And that’s extremely worrying.

Conclusion

I have found the above thought experiment to be very persuasive and clarifying in conversations I have had about AI safety. I’m not entirely sure why it is so persuasive, but here are some plausible reasons:

  • AI safety skeptics often demand explicit desciptions of how things might go wrong. This is an understandable demand—people tend to be more readily convinced by examples than by abstract arguments—but it is also a tricky one. Predicting the actions of a superintelligence in any kind of detail is hard by definition, and any attempted guess is liable to be dismissed as science fiction. The thought experiment meets the skeptic half way. It gives the bare bones of a disaster scenario, omitting detailed play-by-plays that might come across as implausible. But crucially, the story is just concrete and structured enough that the skeptic can fill in the details themselves.

  • A lot of people have an easier time imagining oracle-like highly intelligent LLMs, and a harder time imagining AIs that “do stuff” in the real world. Perhaps this is because our current most capable models happen to be LLMs. Artificially splitting a system into an LLM, which takes care of the “thinking”, and an agent, which takes care of the “doing”, seems to help people make the mental jump from “an LLM that can describe a strategy for gaining power over humans” (which most people can imagine but don’t find that scary) to “an AI agent that actually goes out and tries to overpower humans” (which is scary but harder to imagine).

  • It is pretty easy to follow how gradient descent is responsible for the emergent dangerous behaviour in the thought experiment: all of the AI’s actions upon deployment are straightforward generalisations of actions that it was rewarded for during training. Phrases like “the deployed AI seeks reward/​aims to maximise reward/​has a goal of obtaining reward” are sloppy, and can be a red flag to skeptics, so our story avoids them. Likewise, there’s no need for the tricky discussion of agents reasoning about their own reward functions, or about their own shutdown (note that the LLM reasons about the potential shutdown of the agent, but somehow this confuses people much less than an agent reasoning about its own shutdown). Avoiding these conversational minefields is absolutely crucial if one wants to have a productive conversation about AI safety with a skeptic.

  • Splitting the AI system into an LLM and an agent lets us imagine the two talking to each other. Dialogues are often more compelling and tangible than imagining the inner monologue of a single agent.

I encourage the reader to try working through this thought experiment with AI-safety-skeptic friends or colleagues. It would be useful to know which parts are most clarifying and persuasive, and which parts are confusing.