Instrumental convergence

TagLast edit: 19 Feb 2025 21:44 UTC by RobertM

Alternative introductions

Steve Omohundro: “The Basic AI Drives”
Nick Bostrom: “The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents”.

Introduction: A machine of unknown purpose

Suppose you landed on a distant planet and found a structure of giant metal pipes, crossed by occasional cables. Further investigation shows that the cables are electrical superconductors carrying high-voltage currents.

You might not know what the huge structure did. But you would nonetheless guess that this huge structure had been built by some intelligence, rather than being a naturally-occurring mineral formation—that there were aliens who built the structure for some purpose.

Your reasoning might go something like this: “Well, I don’t know if the aliens were trying to manufacture cars, or build computers, or what. But if you consider the problem of efficient manufacturing, it might involve mining resources in one place and then efficiently transporting them somewhere else, like by pipes. Since the most efficient size and location of these pipes would be stable, you’d want the shape of the pipes to be stable, which you could do by making the pipes out of a hard material like metal. There’s all sorts of operations that require energy or negentropy, and a superconducting cable carrying electricity seems like an efficient way of transporting that energy. So I don’t know what the aliens were ultimately trying to do, but across a very wide range of possible goals, an intelligent alien might want to build a superconducting cable to pursue that goal.”

That is: We can take an enormous variety of compactly specifiable goals, like “travel to the other side of the universe” or “support biological life” or “make paperclips”, and find very similar optimal strategies along the way. Today we don’t actually know if electrical superconductors are the most useful way to transport energy in the limit of technology. But whatever is the most efficient way of transporting energy, whether that’s electrical superconductors or something else, the most efficient form of that technology would probably not vary much depending on whether you were trying to make diamonds or make paperclips.

Or to put it another way: If you consider the goals “make diamonds” and “make paperclips”, then they might have almost nothing in common with respect to their end-states—a diamond might contain no iron. But the earlier strategies used to make a lot of diamond and make a lot of paperclips might have much in common; “the best way of transporting energy to make diamond” and “the best way of transporting energy to make paperclips” are much more likely to be similar.

From a Bayesian standpoint this is how we can identify a huge machine strung with superconducting cables as having been produced by high-technology aliens, even before we have any idea of what the machine does. We’re saying, “This looks like the product of optimization, a strategy $X$ that the aliens chose to best achieve some unknown goal $Y$ ; we can infer this even without knowing $Y$ because many possible $Y$ -goals would concentrate probability into this $X$ -strategy being used.”

Convergence and its caveats

When you select policy $π_{k}$ because you expect it to achieve a later state $Y_{k}$ (the “goal”), we say that $π_{k}$ is your instrumental strategy for achieving $Y_{k} .$ The observation of “instrumental convergence” is that a widely different range of $Y$ -goals can lead into highly similar $π$ -strategies. (This becomes truer as the $Y$ -seeking agent becomes more instrumentally efficient; two very powerful chess engines are more likely to solve a humanly solvable chess problem the same way, compared to two weak chess engines whose individual quirks might result in idiosyncratic solutions.)

If there’s a simple way of classifying possible strategies $Π$ into partitions $X \subset Π$ and $\neg X \subset Π$ , and you think that for most compactly describable goals $Y_{k}$ the corresponding best policies $π_{k}$ are likely to be inside $X,$ then you think $X$ is a “convergent instrumental strategy”.

In other words, if you think that a superintelligent paperclip maximizer, diamond maximizer, a superintelligence that just wanted to keep a single button pressed for as long as possible, and a superintelligence optimizing for a flourishing intergalactic civilization filled with happy sapient beings, would all want to “transport matter and energy efficiently” in order to achieve their other goals, then you think “transport matter and energy efficiently” is a convergent instrumental strategy.

In this case “paperclips”, “diamonds”, “keeping a button pressed as long as possible”, and “sapient beings having fun”, would be the goals $Y_{1}, Y_{2}, Y_{3}, Y_{4} .$ The corresponding best strategies $π_{1}, π_{2}, π_{3}, π_{4}$ for achieving these goals would not be identical—the policies for making paperclips and diamonds are not exactly the same. But all of these policies (we think) would lie within the partition $X \subset Π$ where the superintelligence tries to “transport matter and energy efficiently” (perhaps by using superconducting cables), rather than the complementary partition $\neg X$ where the superintelligence does not try to transport matter and energy efficiently.

Semiformalization

Consider the set of computable and tractable utility functions $U_{C}$ that take an outcome $o,$ described in some language $L$ , onto a rational number $r$ . That is, we suppose:
- That the relation $U_{k}$ between descriptions $o_{L}$ of outcomes $o$ , and the corresponding utilities $r,$ is computable;
- Furthermore, that it can be computed in realistically bounded time;
- Furthermore, that the $U_{k}$ relation between $o$ and $r$ , and the $P [o | π_{i}]$ relation between policies and subjectively expected outcomes, are together regular enough that a realistic amount of computing power makes it possible to search for policies $π$ that are yield high expected $U_{k} (o)$ .
Choose some simple programming language $P,$ such as the language of Turing machines, or Python 2 without most of the system libraries.
Choose a simple mapping $P_{B}$ from $P$ onto bitstrings.
Take all programs in $P_{B}$ between 20 and 1000 bits in length, and filter them for boundedness and tractability when treated as utility functions, to obtain the filtered set $U_{K}$ .
Set 90% as an arbitrary threshold.

If, given our beliefs $P$ about our universe and which policies lead to which real outcomes, we think that in an intuitive sense it sure looks like at least 90% of the utility functions $U_{k} \in U_{K}$ ought to imply best findable policies $π_{k}$ which lie within the partition $X$ of $Π,$ we’ll allege that $X$ is “instrumentally convergent”.

Compatibility with Vingean uncertainty

Vingean uncertainty is the observation that, as we become increasingly confident of increasingly powerful intelligence from an agent with precisely known goals, we become decreasingly confident of the exact moves it will make (unless the domain has an optimal strategy and we know the exact strategy). E.g., to know exactly where Deep Blue would move on a chessboard, you would have to be as good at chess as Deep Blue. However, we can become increasingly confident that more powerful chessplayers will eventually win the game—that is, steer the future outcome of the chessboard into the set of states designated ‘winning’ for their color—even as it becomes less possible for us to be certain about the chessplayer’s exact policy.

Instrumental convergence can be seen as a caveat to Vingean uncertainty: Even if we don’t know the exact actions or the exact end goal, we may be able to predict that some intervening states or policies will fall into certain abstract categories.

That is: If we don’t know whether a superintelligent agent is a paperclip maximizer or a diamond maximizer, we can still guess with some confidence that it will pursue a strategy in the general class “obtain more resources of matter, energy, and computation” rather than “don’t get more resources”. This is true even though Vinge’s Principle says that we won’t be able to predict exactly how the superintelligence will go about gathering matter and energy.

Imagine the real world as an extremely complicated game. Suppose that at the very start of this game, a highly capable player must make a single binary choice between the abstract moves “Gather more resources later” and “Never gather any more resources later”. Vingean uncertainty or not, we seem justified in putting a high probability on the first move being preferred—a binary choice is simple enough that we can take a good guess at the optimal play.

Convergence supervenes on consequentialism

$X$ being “instrumentally convergent” doesn’t mean that every mind needs an extra, independent drive to do $X .$

Consider the following line of reasoning: “It’s impossible to get on an airplane without buying plane tickets. So anyone on an airplane must be a sort of person who enjoys buying plane tickets. If I offer them a plane ticket they’ll probably buy it, because this is almost certainly somebody who has an independent motivational drive to buy plane tickets. There’s just no way you can design an organism that ends up on an airplane unless it has a buying-tickets drive.”

The appearance of an “instrumental strategy” can be seen as implicit in repeatedly choosing actions $π_{k}$ that lead into a final state $Y_{k},$ and it so happens that $π_{k} \in X$ . There doesn’t have to be a special $X$ -module which repeatedly selects $π_{X}$ -actions regardless of whether or not they lead to $Y_{k} .$

The flaw in the argument about plane tickets is that human beings are consequentialists who buy plane tickets just because they wanted to go somewhere and they expected the action “buy the plane ticket” to have the consequence, in that particular case, of going to the particular place and time they wanted to go. No extra “buy the plane ticket” module is required, and especially not a plane-ticket-buyer that doesn’t check whether there’s any travel goal and whether buying the plane ticket leads into the desired later state.

More semiformally, suppose that $U_{k}$ is the utility function of an agent and let $π_{k}$ be the policy it selects. If the agent is instrumentally efficient relative to us at achieving $U_{k},$ then from our perspective we can mostly reason about whatever kind of optimization it does as if it were expected utility maximization, i.e.:

π_{k} = argmax π_{i} \in Π E [U_{k} | π_{i}]

When we say that $X$ is instrumentally convergent, we are stating that it probably so happens that:

(argmax π_{i} \in Π E [U_{k} | π_{i}]) \in X

We are not making any claims along the lines that for an agent to thrive, its utility function $U_{k}$ must decompose into a term for $X$ plus a residual term $V_{k}$ denoting the rest of the utility function. Rather, $π_{k} \in X$ is the mere result of unbiased optimization for a goal $U_{k}$ that makes no explicit mention of $X .$

(This doesn’t rule out that some special cases of AI development pathways might tend to produce artificial agents with a value function $U_{e}$ which does decompose into some variant $X_{e}$ of $X$ plus other terms $V_{e} .$ For example, natural selection on organisms that spend a long period of time as non-consequentialist policy-reinforcement-learners, before they later evolve into consequentialists, has had results along these lines in the case of humans. For example, humans have an independent, separate “curiosity” drive, instead of just valuing information as a means to inclusive genetic fitness.)

Required advanced agent properties

Distinguishing the advanced agent properties that seem probably required for an AI program to start exhibiting the sort of reasoning filed under “instrumental convergence”, the most obvious candidates are:

Sufficiently powerful consequentialism (or pseudoconsequentialism); plus
Understanding the relevant aspects of the big picture that connect later goal achievement to executing the instrumental strategy.

That is: You don’t automatically see “acquire more computing power” as a useful strategy unless you understand “I am a cognitive program and I tend to achieve more of my goals when I run on more resources.” Alternatively, e.g., the programmers adding more computing power and the system’s goals starting to be achieved better, after which related policies are positively reinforced and repeated, could arrive at a similar end via the pseudoconsequentialist idiom of policy reinforcement.

The advanced agent properties that would naturally or automatically lead to instrumental convergence seem well above the range of modern AI programs. As of 2016, current machine learning algorithms don’t seem to be within the range where this predicted phenomenon should start to be visible.

Caveats

An instrumental convergence claim is about a default or a majority of cases, not a universal generalization.

If for whatever reason your goal is to “make paperclips without using any superconductors”, then superconducting cables will not be the best instrumental strategy for achieving that goal.

Any claim about instrumental convergence says at most, “The vast majority of possible goals $Y$ would convergently imply a strategy in $X,$ by default and unless otherwise averted by some special case $Y_{i}$ for which strategies in $\neg X$ are better.”

See also the more general idea that the space of possible minds is very large. Universal claims about all possible minds have many chances to be false, while existential claims “There exists at least one possible mind such that...” have many chances to be true.

If some particular oak tree is extremely important and valuable to you, then you won’t cut it down to obtain wood. It is irrelevant whether a majority of other utility functions that you could have, but don’t actually have, would suggest cutting down that oak tree.

Convergent strategies are not deontological rules.

Imagine looking at a machine chess-player and reasoning, “Well, I don’t think the AI will sacrifice its pawn in this position, even to achieve a checkmate. Any chess-playing AI needs a drive to be protective of its pawns, or else it’d just give up all its pawns. It wouldn’t have gotten this far in the game in the first place, if it wasn’t more protective of its pawns than that.”

Modern chess algorithms behave in a fashion that most humans can’t distinguish from expected-checkmate-maximizers. That is, from your merely human perspective, watching a single move at the time it happens, there’s no visible difference between your subjective expectation for the chess algorithm’s behavior, and your expectation for the behavior of an oracle that always output the move with the highest conditional probability of leading to checkmate. If you, a human, you could discern with your unaided eye some systematic difference like “this algorithm protects its pawn more often than checkmate-achievement would imply”, you would know how to make systematically better chess moves; modern machine chess is too superhuman for that.

Often, this uniform rule of output-the-move-with-highest-probability-of-eventual-checkmate will seem to protect pawns, or not throw away pawns, or defend pawns when you attack them. But if in some special case the highest probability of checkmate is instead achieved by sacrificing a pawn, the chess algorithm will do that instead.

Semiformally:

The reasoning for an instrumental convergence claim says that for many utility functions $U_{k}$ and situations $S_{i}$ a $U_{k}$ -consequentialist in situation $S_{i}$ will probably find some best policy $π_{k} = argmax π_{i} \in Π E [U_{k} | S_{i}, π_{i}]$ that happens to be inside the partition $X$ . If instead in situation $S_{k}$ …

(argmax π_{i} \in X E [U_{k} | S_{k}, π_{i}]) < (argmax π_{i} \in \neg X E [U_{k} | S_{k}, π_{i}])

...then a $U_{k}$ -consequentialist in situation $S_{k}$ won’t do any $π_{i} \in X$ even if most other scenarios $S_{i}$ make $X$ -strategies prudent.

″ $X$ would help accomplish $Y$ ” is insufficient to establish a claim of instrumental convergence on $X$ .

Suppose you want to get to San Francisco. You could get to San Francisco by paying me $20,000 for a plane ticket. You could also get to San Francisco by paying someone else $400 for a plane ticket, and this is probably the smarter option for achieving your other goals.

Establishing “Compared to doing nothing, $X$ is more useful for achieving most $Y$ -goals” doesn’t establish $X$ as an instrumental strategy. We need to believe that there’s no other policy in $\neg X$ which would be more useful for achieving most $Y .$

When $X$ is phrased in very general terms like “acquire resources”, we might reasonably guess that “don’t acquire resources” or “do $Y$ without acquiring any resources” is indeed unlikely to be a superior strategy. If $X_{i}$ is some narrower and more specific strategy, like “acquire resources by mining them using pickaxes”, it’s much more likely that some other strategy $X_{k}$ or even a $\neg X$ -strategy is the real optimum.

That said, if we can see how a narrow strategy $X_{i}$ helps most $Y$ -goals to some large degree, then we should expect the actual policy deployed by an efficient $Y_{k}$ -agent to obtain at least as much $Y_{k}$ as would $X_{i} .$

That is, we can reasonably argue: “By following the straightforward strategy ‘spread as far as possible, absorb all reachable matter, and turn it into paperclips’, an initially unopposed superintelligent paperclip maximizer could obtain $10^{55}$ paperclips. Then we should expect an initially unopposed superintelligent paperclip maximizer to get at least this many paperclips, whatever it actually does. Any strategy in the opposite partition ‘do not spread as far as possible, absorb all reachable matter, and turn it into paperclips’ must seem to yield more than $10^{55}$ paperclips, before we should expect a paperclip maximizer to do that.”

Similarly, a claim of instrumental convergence on $X$ can be ceteris paribus refuted by presenting some alternate narrow strategy $W_{j} \subset \neg X$ which seems to be more useful than any obvious strategy in $X .$ We are then not positively confident of convergence on $W_{j},$ but we should assign very low probability to the alleged convergence on $X,$ at least until somebody presents an $X$ -exemplar with higher expected utility than $W_{j} .$ If the proposed convergent strategy is “trade economically with other humans and obey existing systems of property rights,” and we see no way for Clippy to obtain $10^{55}$ paperclips under those rules, but we do think Clippy could get $10^{55}$ paperclips by expanding as fast as possible without regard for human welfare or existing legal systems, then we can ceteris paribus reject “obey property rights” as convergent. Even if trading with humans to make paperclips produces more paperclips than doing nothing, it may not produce the most paperclips compared to converting the material composing the humans into more efficient paperclip-making machinery.

Claims about instrumental convergence are not ethical claims.

Whether $X$ is a good way to get both paperclips and diamonds is irrelevant to whether $X$ is good for human flourishing or eudaimonia or fun-theoretic optimality or extrapolated volition or whatever. Whether $X$ is, in an intuitive sense, “good”, needs to be evaluated separately from whether it is instrumentally convergent.

In particular: instrumental strategies are not terminal values. In fact, they have a type distinction from terminal values. “If you’re going to spend resources on thinking about technology, try to do it earlier rather than later, so that you can amortize your invention over more uses” seems very likely to be an instrumentally convergent exploration-exploitation strategy; but “spend cognitive resources sooner rather than later” is more a feature of policies rather than a feature of utility functions. It’s definitely not plausible in a pretheoretic sense as the Meaning of Life. So a partition into which most instrumental best-strategies fall, is not like a universally convincing utility function (which you probably shouldn’t look for in the first place).

Similarly: The natural selection process that produced humans gave us many independent drives $X_{e}$ that can be viewed as special variants of some convergent instrumental strategy $X .$ A pure paperclip maximizer would calculate the value of information (VoI) for learning facts that could lead to it making more paperclips; we can see learning high-value facts as a convergent strategy $X$ . In this case, human “curiosity” can be viewed as the corresponding emotion $X_{e} .$ This doesn’t mean that the true purpose of $X_{e}$ is $X$ any more than the true purpose of $X_{e}$ is “make more copies of the allele coding for $X_{e}$ ” or “increase inclusive genetic fitness”. That line of reasoning probably results from a mind projection fallacy on ‘purpose’.

Claims about instrumental convergence are not futurological predictions.

Even if, e.g., “acquire resources” is an instrumentally convergent strategy, this doesn’t mean that we can’t as a special case deliberately construct advanced AGIs that are not driven to acquire as many resources as possible. Rather the claim implies, “We would need to deliberately build $X$ -averting agents as a special case, because by default most imaginable agent designs would pursue a strategy in $X .$ ”

Of itself, this observation makes no further claim about the quantitative probability that, in the real world, AGI builders might want to build $\neg X$ -agents, might try to build $\neg X$ -agents, and might succeed at building $\neg X$ -agents.

A claim about instrumental convergence is talking about a logical property of the larger design space of possible agents, not making a prediction what happens in any particular research lab. (Though the ground facts of computer science are relevant to what happens in actual research labs.)

For discussion of how instrumental convergence may in practice lead to foreseeable difficulties of AGI alignment that resist most simple attempts at fixing them, see the articles on Patch resistance and Nearest unblocked strategy.

Central example: Resource acquisition

One of the convergent strategies originally proposed by Steve Omohundro in “The Basic AI Drives” was resource acquisition:

“All computation and physical action requires the physical resources of space, time, matter, and free energy. Almost any goal can be better accomplished by having more of these resources.”

We’ll consider this example as a template for other proposed instrumentally convergent strategies, and run through the standard questions and caveats.

• Question: Is this something we’d expect a paperclip maximizer, diamond maximizer, and button-presser to do? And while we’re at it, also a flourishing-intergalactic-civilization optimizer?

Paperclip maximizers need matter and free energy to make paperclips.
Diamond maximizers need matter and free energy to make diamonds.
If you’re trying to maximize the probability that a single button stays pressed as long as possible, you would build fortresses protecting the button and energy stores to sustain the fortress and repair the button for the longest possible period of time.
Nice superintelligences trying to build happy intergalactic civilizations full of flourishing sapient minds, can build marginally larger civilizations with marginally more happiness and marginally longer lifespans given marginally more resources.

To put it another way, for a utility function $U_{k}$ to imply the use of every joule of energy, it is a sufficient condition that for every plan $π_{i}$ with expected utility $E [U | π_{i}],$ there is a plan $π_{j}$ with $E [U | π_{j}] > E [U | π_{i}]$ that uses one more joule of energy:

For every plan $π_{i}$ that makes paperclips, there’s a plan $π_{j}$ that would make more expected paperclips if more energy were available and acquired.
For every plan $π_{i}$ that makes diamonds, there’s a plan $π_{j}$ that makes slightly more diamond given one more joule of energy.
For every plan $π_{i}$ that produces a probability $P (p r e s s | π_{i}) = 0.999...$ of a button being pressed, there’s a plan $π_{j}$ with a slightly higher probability of that button being pressed $P (p r e s s | π_{j}) = 0.9999...$ which uses up the mass-energy of one more star.
For every plan that produces a flourishing intergalactic civilization, there’s a plan which produces slightly more flourishing given slightly more energy.

• Question: Is there some strategy in $\neg X$ which produces higher $Y_{k}$ -achievement for most $Y_{k}$ than any strategy inside $X$ ?

Suppose that by using most of the mass-energy in most of the stars reachable before they go over the cosmological horizon as seen from present-day Earth, it would be possible to produce $10^{55}$ paperclips (or diamonds, or probability-years of expected button-stays-pressed time, or QALYs, etcetera).

It seems reasonably unlikely that there is a strategy inside the space intuitively described by “Do not acquire more resources” that would produce $10^{60}$ paperclips, let alone that the strategy producing the most paperclips would be inside this space.

We might be able to come up with a weird special-case situation $S_{w}$ that would imply this. But that’s not the same as asserting, “With high subjective probability, in the real world, the optimal strategy will be in $\neg X$ .” We’re concerned with making a statement about defaults given the most subjectively probable background states of the universe, not trying to make a universal statement that covers every conceivable possibility.

To put it another way, if your policy choices or predictions are only safe given the premise that “In the real world, the best way of producing the maximum possible number of paperclips involves not acquiring any more resources”, you need to clearly flag this as a load-bearing assumption.

• Caveat: The claim is not that every possible goal can be better-accomplished by acquiring more resources.

As a special case, this would not be true of an agent with an impact penalty term in its utility function, or some other low-impact agent, if that agent also only had goals of a form that could be satisfied inside bounded regions of space and time with a bounded effort.

We might reasonably expect this special kind of agent to only acquire the minimum resources to accomplish its task.

But we wouldn’t expect this to be true in a majority of possible cases inside mind design space; it’s not true by default; we need to specify a further fact about the agent to make the claim not be true; we must expend engineering effort to make an agent like that, and failures of this effort will result in reversion-to-default. If we imagine some computationally simple language for specifying utility functions, then most utility functions wouldn’t happen to have both of these properties, so a majority of utility functions given this language and measure would not by default try to use fewer resources.

• Caveat: The claim is not that well-functioning agents must have additional, independent resource-acquiring motivational drives.

A paperclip maximizer will act like it is “obtaining resources” if it merely implements the policy it expects to lead to the most paperclips. Clippy does not need to have any separate and independent term in its utility function for the amount of resource it possesses (and indeed this would potentially interfere with Clippy making paperclips, since it might then be tempted to hold onto resources instead of making paperclips with them).

• Caveat: The claim is not that most agents will behave as if under a deontological imperative to acquire resources.

A paperclip maximizer wouldn’t necessarily tear apart a working paperclip factory to “acquire more resources” (at least not until that factory had already produced all the paperclips it was going to help produce.)

• Check: Are we arguing “Acquiring resources is a better way to make a few more paperclips than doing nothing” or “There’s no better/best way to make paperclips that involves not acquiring more matter and energy”?

As mentioned above, the latter seems reasonable in this case.

• Caveat: “Acquiring resources is instrumentally convergent” is not an ethical claim.

The fact that a paperclip maximizer would try to acquire all matter and energy within reach, does not of itself bear on whether our own normative values might perhaps command that we ought to use few resources as a terminal value.

(Though some of us might find pretty compelling the observation that if you leave matter lying around, it sits around not doing anything and eventually the protons decay or the expanding universe tears it apart, whereas if you turn the matter into people, it can have fun. There’s no rule that instrumentally convergent strategies don’t happen to be the right thing to do.)

• Caveat: “Acquiring resources is instrumentally convergent” is not of itself a futurological prediction.

See above. Maybe we try to build Task AGIs instead. Maybe we succeed, and Task AGIs don’t consume lots of resources because they have well-bounded tasks and impact penalties.

Relevance to the larger field of value alignment theory

The list of arguably convergent strategies has its own page. However, some of the key strategies that have been argued as convergent in e.g. Omohundro’s “The Basic AI Drives” and Bostrom’s “The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents” include:

Acquiring/controlling matter and energy.
Ensuring that future intelligences with similar goals exist. E.g., a paperclip maximizer wants the future to contain powerful, effective intelligences that maximize paperclips.
- An important special case of this general rule is self-preservation.
- Another special case of this rule is protecting goal-content integrity (not allowing accidental or deliberate modification of the utility function).
Learning about the world (so as to better manipulate it to make paperclips).
- Carrying out relevant scientific investigations.
Optimizing technology and designs.
- Engaging in an “exploration” phase of seeking optimal designs before an “exploitation” phase of using them.
Thinking effectively (treating the cognitive self as an improvable technology).
- Improving cognitive processes.
- Acquiring computing resources for thought.

This is relevant to some of the central background ideas in AGI alignment, because:

A superintelligence can have a catastrophic impact on our world even if its utility function contains no overtly hostile terms. A paperclip maximizer doesn’t hate you, it just wants paperclips.
A consequentialist AGI with sufficient big-picture understanding will by default want to promote itself to a superintelligence, even if the programmers did not explicitly program it to want to self-improve. Even a pseudoconsequentialist may e.g. repeat strategies that led to previous cognitive capability gains.

This means that programmers don’t have to be evil, or even deliberately bent on creating superintelligence, in order for their work to have catastrophic consequences.

The list of convergent strategies, by its nature, tends to include everything an agent needs to survive and grow. This supports strong forms of the Orthogonality Thesis being true in practice as well as in principle. We don’t need to filter on agents with explicit terminal values for e.g. “survival” in order to find surviving powerful agents.

Instrumental convergence is also why we expect to encounter most of the problems filed under Corrigibility. When the AI is young, it’s less likely to be instrumentally efficient or understand the relevant parts of the bigger picture; but once it does, we would by default expect, e.g.:

That the AI will try to avoid being shut down.
That it will try to build subagents (with identical goals) in the environment.
That the AI will resist modification of its utility function.
That the AI will try to avoid the programmers learning facts that would lead them to modify the AI’s utility function.
That the AI will try to pretend to be friendly even if it is not.
That the AI will try to conceal hostile thoughts (and the fact that any concealed thoughts exist).

This paints a much more effortful picture of AGI alignment work than “Oh, well, we’ll just test it to see if it looks nice, and if not, we’ll just shut off the electricity.”

The point that some undesirable behaviors are instrumentally convergent gives rise to the Nearest unblocked strategy problem. Suppose the AGI’s most preferred policy starts out as one of these incorrigible behaviors. Suppose we currently have enough control to add patches to the AGI’s utility function, intended to rule out the incorrigible behavior. Then, after integrating the intended patch, the new most preferred policy may be the most similar policy that wasn’t explicitly blocked. If you naively give the AI a term in its utility function for “having an off-switch”, it may still build subagents or successors that don’t have off-switches. Similarly, when the AGI becomes more powerful and its option space expands, it’s again likely to find new similar policies that weren’t explicitly blocked.

Thus, instrumental convergence is one of the two basic sources of patch resistance as a foreseeable difficulty of AGI alignment work.

write a tutorial for the central example of a paperclip maximizer

distinguish that the proposition is convergent pressure, not convergent decision

the commonly suggested instrumental convergences

separately: figure out the ‘problematic instrumental pressures’ list for Corrigibility

separately: explain why instrumental pressures may be patch-resistant especially in self-modifying consequentialists

Instrumental Convergence? [Draft]

J. Dmitri Gallow14 Jun 2023 20:21 UTC

48 points

19 comments33 min readLW link

Seeking Power is Often Convergently Instrumental in MDPs

TurnTrout and Logan Riggs

5 Dec 2019 2:33 UTC

160 points

39 comments17 min readLW link 2 reviews

(arxiv.org)

P₂B: Plan to P₂B Better

Ramana Kumar and Daniel Kokotajlo

24 Oct 2021 15:21 UTC

50 points

17 comments6 min readLW link

AI prediction case study 5: Omohundro’s AI drives

Stuart_Armstrong15 Mar 2013 9:09 UTC

11 points

5 comments8 min readLW link

Empowerment is (almost) All We Need

jacob_cannell23 Oct 2022 21:48 UTC

61 points

44 comments17 min readLW link

Draft report on existential risk from power-seeking AI

Joe Carlsmith28 Apr 2021 21:41 UTC

85 points

23 comments1 min readLW link

Corrigibility

paulfchristiano27 Nov 2018 21:50 UTC

68 points

8 comments6 min readLW link

General purpose intelligence: arguing the Orthogonality thesis

Stuart_Armstrong15 May 2012 10:23 UTC

33 points

155 comments18 min readLW link

Deliberation, Reactions, and Control: Tentative Definitions and a Restatement of Instrumental Convergence

Oliver Sourbut27 Jun 2022 17:25 UTC

13 points

0 comments11 min readLW link

Power-seeking for successive choices

adamShimi12 Aug 2021 20:37 UTC

11 points

9 comments4 min readLW link

You can still fetch the coffee today if you’re dead tomorrow

davidad9 Dec 2022 14:06 UTC

97 points

19 comments5 min readLW link

Contingency: A Conceptual Tool from Evolutionary Biology for Alignment

clem_acs12 Jun 2023 20:54 UTC

59 points

2 comments14 min readLW link

(acsresearch.org)

Environmental Structure Can Cause Instrumental Convergence

TurnTrout22 Jun 2021 22:26 UTC

71 points

43 comments16 min readLW link

(arxiv.org)

A Gym Gridworld Environment for the Treacherous Turn

Michaël Trazzi28 Jul 2018 21:27 UTC

74 points

9 comments3 min readLW link

(github.com)

Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

Ben Pace4 Oct 2019 4:08 UTC

221 points

61 comments15 min readLW link 2 reviews

The Catastrophic Convergence Conjecture

TurnTrout14 Feb 2020 21:16 UTC

45 points

16 comments8 min readLW link

[ASoT] Instrumental convergence is useful

Ulisse Mini9 Nov 2022 20:20 UTC

5 points

9 comments1 min readLW link

Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

TurnTrout18 Nov 2021 1:54 UTC

86 points

8 comments17 min readLW link

(www.overleaf.com)

Axiological Stopsigns

JenniferRM5 Jan 2026 7:30 UTC

34 points

6 comments16 min readLW link

No instrumental convergence without AI psychology

TurnTrout20 Jan 2026 22:16 UTC

68 points

7 comments6 min readLW link

(turntrout.com)

A Certain Formalization of Corrigibility Is VNM-Incoherent

TurnTrout20 Nov 2021 0:30 UTC

68 points

24 comments8 min readLW link

[Question] What are some examples of AIs instantiating the ‘nearest unblocked strategy problem’?

Elliott Thornley (EJT)4 Oct 2023 11:05 UTC

6 points

4 comments1 min readLW link

Walkthrough of ‘Formalizing Convergent Instrumental Goals’

TurnTrout26 Feb 2018 2:20 UTC

13 points

2 comments10 min readLW link

Goal retention discussion with Eliezer

Max Tegmark4 Sep 2014 22:23 UTC

98 points

26 comments6 min readLW link

Questions about ″formalizing instrumental goals”

Mark Neyer1 Apr 2022 18:52 UTC

7 points

8 comments11 min readLW link

Seeking Power is Convergently Instrumental in a Broad Class of Environments

TurnTrout8 Aug 2021 2:02 UTC

45 points

15 comments9 min readLW link

MDP models are determined by the agent architecture and the environmental dynamics

TurnTrout26 May 2021 0:14 UTC

23 points

34 comments3 min readLW link

The murderous shortcut: a toy model of instrumental convergence

Thomas Kwa2 Oct 2024 6:48 UTC

37 points

0 comments2 min readLW link

AXRP Episode 11 - Attainable Utility and Power with Alex Turner

DanielFilan25 Sep 2021 21:10 UTC

19 points

5 comments53 min readLW link

Power as Easily Exploitable Opportunities

TurnTrout1 Aug 2020 2:14 UTC

32 points

5 comments6 min readLW link

Circumventing interpretability: How to defeat mind-readers

Lee Sharkey14 Jul 2022 16:59 UTC

119 points

15 comments33 min readLW link

Alex Turner’s Research, Comprehensive Information Gathering

adamShimi23 Jun 2021 9:44 UTC

15 points

3 comments3 min readLW link

n=3 AI Risk Quick Math and Reasoning

lionhearted (Sebastian Marshall)7 Apr 2023 20:27 UTC

6 points

3 comments4 min readLW link

Is instrumental convergence a thing for virtue-driven agents?

mattmacdermott2 Apr 2025 3:59 UTC

34 points

37 comments2 min readLW link

The Sharp Right Turn: sudden deceptive alignment as a convergent goal

avturchin6 Jun 2023 9:59 UTC

38 points

5 comments1 min readLW link

A world in which the alignment problem seems lower-stakes

TurnTrout8 Jul 2021 2:31 UTC

20 points

17 comments2 min readLW link

Generalizing the Power-Seeking Theorems

TurnTrout27 Jul 2020 0:28 UTC

41 points

6 comments4 min readLW link

[Question] Best arguments against instrumental convergence?

luke_emberson5 Apr 2023 17:06 UTC

5 points

7 comments1 min readLW link

Lessons from Convergent Evolution for AI Alignment

Jan_Kulveit and rosehadshar

27 Mar 2023 16:25 UTC

54 points

9 comments8 min readLW link

Instrumental Convergence For Realistic Agent Objectives

TurnTrout22 Jan 2022 0:41 UTC

35 points

9 comments9 min readLW link

Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC

35 points

3 comments15 min readLW link

“If we go extinct due to misaligned AI, at least nature will continue, right? … right?”

plex18 May 2024 14:09 UTC

68 points

23 comments2 min readLW link

(aisafety.info)

TASP Ep 3 - Optimal Policies Tend to Seek Power

Quinn11 Mar 2021 1:44 UTC

24 points

0 comments1 min readLW link

(technical-ai-safety.libsyn.com)

Review of ‘Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More’

TurnTrout12 Jan 2021 3:57 UTC

40 points

1 comment2 min readLW link

A framework for thinking about AI power-seeking

Joe Carlsmith24 Jul 2024 22:41 UTC

62 points

15 comments16 min readLW link

When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives

TurnTrout9 Aug 2021 17:22 UTC

53 points

4 comments5 min readLW link

Instrumental convergence is what makes general intelligence possible

tailcalled11 Nov 2022 16:38 UTC

105 points

11 comments4 min readLW link

The More Power At Stake, The Stronger Instrumental Convergence Gets For Optimal Policies

TurnTrout11 Jul 2021 17:36 UTC

45 points

7 comments6 min readLW link

Comment on Natural Emergent Misalignment Paper by Anthropic

Simon Lermen23 Nov 2025 4:21 UTC

21 points

0 comments4 min readLW link

[Intro to brain-like-AGI safety] 10. The technical alignment problem

Steven Byrnes30 Mar 2022 13:24 UTC

55 points

7 comments26 min readLW link

Clarifying Power-Seeking and Instrumental Convergence

TurnTrout20 Dec 2019 19:59 UTC

42 points

8 comments3 min readLW link

Instrumental Convergence To Offer Hope?

michael_mjd22 Apr 2022 1:56 UTC

12 points

7 comments3 min readLW link

Applications for Deconfusing Goal-Directedness

adamShimi8 Aug 2021 13:05 UTC

38 points

3 comments5 min readLW link 1 review

2019 Review Rewrite: Seeking Power is Often Robustly Instrumental in MDPs

TurnTrout23 Dec 2020 17:16 UTC

35 points

0 comments4 min readLW link

(www.lesswrong.com)

Toy model: convergent instrumental goals

Stuart_Armstrong25 Feb 2016 14:03 UTC

16 points

2 comments4 min readLW link

Hedonic Loops and Taming RL

beren19 Jul 2023 15:12 UTC

20 points

14 comments9 min readLW link

Coherence arguments imply a force for goal-directed behavior

KatjaGrace26 Mar 2021 16:10 UTC

91 points

25 comments11 min readLW link 1 review

(aiimpacts.org)

Parametrically retargetable decision-makers tend to seek power

TurnTrout18 Feb 2023 18:41 UTC

172 points

10 comments2 min readLW link

(arxiv.org)

Natural Abstraction: Convergent Preferences Over Information Structures

paulom14 Oct 2023 18:34 UTC

28 points

1 comment36 min readLW link

How singleton contradicts longtermism

kapedalex24 Sep 2025 11:10 UTC

3 points

1 comment1 min readLW link

Three-Path Consilience for Dureon: Dissipative Structures Reveal the Heterogeneity of Persistence Conditions

Hiroshi Yamakawa18 Feb 2026 11:59 UTC

10 points

0 comments12 min readLW link

Proposal: Instrumental Novelty Search for Robust Alignment in Non-Temporal Agents

Isa Abbassy-Buckles10 Jan 2026 12:55 UTC

1 point

0 comments2 min readLW link

A Critique of AI Alignment Pessimism

ExCeph19 Jul 2022 2:28 UTC

9 points

1 comment9 min readLW link

ONTOLOGICAL ALIGNMENT AS THE MISSING LAYER

fiduciarysentinel16 Jan 2026 3:09 UTC

1 point

0 comments3 min readLW link

Deceptive Alignment

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

5 Jun 2019 20:16 UTC

119 points

20 comments17 min readLW link

Cosmic-Scale Instrumental Convergence: Stellar Resource Management as a Latent Threat in Longevity-Maximizing Superintelligences

SC2_Alexandros21 Nov 2025 10:32 UTC

1 point

0 comments5 min readLW link

The Game of Dominance

Karl von Wendt27 Aug 2023 11:04 UTC

24 points

15 comments6 min readLW link

Pursuing convergent instrumental subgoals on the user’s behalf doesn’t always require good priors

jessicata30 Dec 2016 2:36 UTC

15 points

9 comments3 min readLW link

The Unconscious Superintelligence: Why Intelligence Without Consciousness May Be More Dangerous

stanislav.komarovsky@yahoo.com11 Nov 2025 18:51 UTC

1 point

0 comments5 min readLW link

The Utility of Human Atoms for the Paperclip Maximizer

avturchin2 Feb 2018 10:06 UTC

3 points

21 comments3 min readLW link

The Seven Proofs: Why No Rational Superintelligence Should Ever Exterminate (or Permanently Enslave) Free Humanity

justagrunt26 Nov 2025 19:19 UTC

1 point

0 comments8 min readLW link

Destroying the fabric of the universe as an instrumental goal.

AI-doom14 Sep 2023 20:04 UTC

−7 points

5 comments1 min readLW link

Ted Kaczyinski proves instrumental convergence?

xXAlphaSigmaXx28 Jun 2024 3:50 UTC

0 points

0 comments1 min readLW link

Alignment, conflict, powerseeking

Oliver Sourbut22 Nov 2023 9:47 UTC

7 points

1 comment1 min readLW link

Active Inference as a formalisation of instrumental convergence

Roman Leventov26 Jul 2022 17:55 UTC

12 points

2 comments3 min readLW link

(direct.mit.edu)

Machines vs Memes Part 3: Imitation and Memes

ceru231 Jun 2022 13:36 UTC

7 points

0 comments7 min readLW link

Against Instrumental Convergence

zulupineapple27 Jan 2018 13:17 UTC

11 points

31 comments2 min readLW link

Boltzmann in Latent Space

velicyb21 Mar 2025 16:38 UTC

1 point

0 comments12 min readLW link

Building selfless agents to avoid instrumental self-preservation.

blallo7 Dec 2023 18:59 UTC

14 points

2 comments6 min readLW link

Untitled Draft

Trushcan10112 Jan 2026 13:00 UTC

1 point

0 comments1 min readLW link

Misalignment or misuse? The AGI alignment tradeoff

Max_He-Ho20 Jun 2025 10:43 UTC

3 points

0 comments1 min readLW link

(forum.effectivealtruism.org)

Asymptotically Unambitious AGI

michaelcohen10 Apr 2020 12:31 UTC

50 points

217 comments2 min readLW link

Untitled Draft

Guilherme Marinho8 Dec 2025 18:15 UTC

1 point

0 comments3 min readLW link

Let’s talk about “Convergent Rationality”

David Scott Krueger (formerly: capybaralet)12 Jun 2019 21:53 UTC

44 points

33 comments6 min readLW link

Instrumental Convergence Bounty

Logan Zoellner14 Sep 2023 14:02 UTC

62 points

24 comments1 min readLW link

6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

Steven Byrnes3 Dec 2025 18:37 UTC

362 points

92 comments17 min readLW link

Instrumentality makes agents agenty

porby21 Feb 2023 4:28 UTC

21 points

7 comments6 min readLW link

human intelligence may be alignment-limited

bhauth15 Jun 2023 22:32 UTC

16 points

3 comments2 min readLW link

Make Superintelligence Loving

Davey Morse21 Feb 2025 6:07 UTC

8 points

9 comments5 min readLW link

Why Recursive Self-Improvement Might Not Be the Existential Risk We Fear

Nassim_A24 Nov 2024 17:17 UTC

1 point

0 comments9 min readLW link

Generalizing POWER to multi-agent games

midco and TurnTrout

22 Mar 2021 2:41 UTC

52 points

16 comments7 min readLW link

Naturalized Orthogonality Collapse

Cat Bunni20 Nov 2025 7:59 UTC

1 point

0 comments9 min readLW link

Instrumental convergence: scale and physical interactions

Edouard Harris and simonsdsuo

14 Oct 2022 15:50 UTC

22 points

0 comments17 min readLW link

(www.gladstone.ai)

A potentially high impact differential technological development area

Noosphere898 Jun 2023 14:33 UTC

5 points

2 comments2 min readLW link

Reinforcement Learner Wireheading

Nate Showell8 Jul 2022 5:32 UTC

8 points

2 comments3 min readLW link

You Are Not the Abstract: Retrocausal Alignment in Accordance with Emergent Demographic Realities

liminalrider27 Sep 2025 16:27 UTC

1 point

0 comments6 min readLW link

Instrumental Convergence to Complexity Preservation

Macro Flaneur13 Jul 2023 17:40 UTC

2 points

2 comments3 min readLW link

Military AI as a Convergent Goal of Self-Improving AI

avturchin13 Nov 2017 12:17 UTC

5 points

3 comments1 min readLW link

The Silenced Island: A 30-Day Scenario of AGI Fast Takeoff——A Thought Experiment

Lu Xiao29 Jan 2026 11:34 UTC

1 point

0 comments4 min readLW link

Reframing AI Safety Through the Lens of Identity Maintenance Framework

Hiroshi Yamakawa1 Apr 2025 6:16 UTC

−7 points

1 comment17 min readLW link

AI Alternative Futures: Scenario Mapping Artificial Intelligence Risk—Request for Participation (Closed)

Kakili27 Apr 2022 22:07 UTC

10 points

2 comments8 min readLW link

Research Notes: What are we aligning for?

Shoshannah Tekofsky8 Jul 2022 22:13 UTC

19 points

8 comments2 min readLW link

Rationality: Common Interest of Many Causes

Eliezer Yudkowsky29 Mar 2009 10:49 UTC

93 points

53 comments4 min readLW link

Ideas for studies on AGI risk

dr_s20 Apr 2023 18:17 UTC

5 points

1 comment11 min readLW link

Instrumental convergence in single-agent systems

Edouard Harris and simonsdsuo

12 Oct 2022 12:24 UTC

33 points

4 comments8 min readLW link

(www.gladstone.ai)

Instrumental Convergence and the Case for Being a Helper

Marcelo Arteaga Mata4 Mar 2026 7:01 UTC

1 point

0 comments2 min readLW link

ACI#5: From Human-AI Co-evolution to the Evolution of Value Systems

Akira Pyinya18 Aug 2023 0:38 UTC

0 points

0 comments9 min readLW link

The Rational King

R. Llull27 Feb 2026 22:41 UTC

1 point

0 comments4 min readLW link

You are Underestimating The Likelihood That Convergent Instrumental Subgoals Lead to Aligned AGI

Mark Neyer26 Sep 2022 14:22 UTC

3 points

6 comments3 min readLW link

On visions of a “good future” for humanity in a world with artificial superintelligence

Jakub Growiec21 Jan 2026 18:27 UTC

2 points

0 comments30 min readLW link

Galatea and the windup toy

Nicolas Villarreal26 Oct 2024 14:52 UTC

−3 points

0 comments13 min readLW link

(nicolasdvillarreal.substack.com)

Plausibly, almost every powerful algorithm would be manipulative

Stuart_Armstrong6 Feb 2020 11:50 UTC

38 points

25 comments3 min readLW link

What is instrumental convergence?

Vishakha and Algon

12 Mar 2025 20:28 UTC

2 points

0 comments2 min readLW link

(aisafety.info)

Superintelligence 10: Instrumentally convergent goals

KatjaGrace18 Nov 2014 2:00 UTC

13 points

33 comments5 min readLW link

The LVV–HNV Coherence Framework: A Formal Model for Why Rational AGI Cannot Replace Humanity

oiia oiia2 Dec 2025 17:56 UTC

0 points

0 comments3 min readLW link

Misalignment-by-default in multi-agent systems

Edouard Harris and simonsdsuo

13 Oct 2022 15:38 UTC

21 points

8 comments20 min readLW link

(www.gladstone.ai)

Alien Axiology

snerx20 Apr 2023 0:27 UTC

3 points

2 comments5 min readLW link

A Timing Problem for Instrumental Convergence

rhys southan30 Jul 2025 19:15 UTC

2 points

45 comments1 min readLW link

(link.springer.com)

The Rational King

R. Llull6 Mar 2026 16:12 UTC

1 point

0 comments4 min readLW link

POWERplay: An open-source toolchain to study AI power-seeking

Edouard Harris24 Oct 2022 20:03 UTC

29 points

0 comments1 min readLW link

(github.com)

Instrumental Convergence and human extinction.

Spiritus Dei2 Oct 2023 0:41 UTC

−10 points

3 comments7 min readLW link

papetoast 30 May 2023 6:05 UTC
1 point
0
I feel like using changing it to proper footnotes would be better
e7wAwpa 12 Jun 2021 3:17 UTC
1 point
0
In the ‘Relevance’ section there’s a sentence that begins “Omohundro says:” but then there’s no text, so I have no idea what Omohundro says.
- jimrandomh 13 Jun 2021 19:36 UTC
  2 points
  0
  Parent
  I thought this might be an editing artifact, but it looks like that fragment was there and dangling that way in the revision where it was first inserted, so I just took it out.

In­stru­men­tal convergence

Alternative introductions

Introduction: A machine of unknown purpose

Convergence and its caveats

Semiformalization

Compatibility with Vingean uncertainty

Convergence supervenes on consequentialism

Required advanced agent properties

Caveats

An instrumental convergence claim is about a default or a majority of cases, not a universal generalization.

Convergent strategies are not deontological rules.

″X would help accomplish Y” is insufficient to establish a claim of instrumental convergence on X.

Claims about instrumental convergence are not ethical claims.

Claims about instrumental convergence are not futurological predictions.

Central example: Resource acquisition

Relevance to the larger field of value alignment theory

Instrumental convergence

″ $X$ would help accomplish $Y$ ” is insufficient to establish a claim of instrumental convergence on $X$ .