RSS

In­stru­men­tal convergence

TagLast edit: 19 Feb 2025 21:44 UTC by RobertM

Alternative introductions

Introduction: A machine of unknown purpose

Suppose you landed on a distant planet and found a structure of giant metal pipes, crossed by occasional cables. Further investigation shows that the cables are electrical superconductors carrying high-voltage currents.

You might not know what the huge structure did. But you would nonetheless guess that this huge structure had been built by some intelligence, rather than being a naturally-occurring mineral formation—that there were aliens who built the structure for some purpose.

Your reasoning might go something like this: “Well, I don’t know if the aliens were trying to manufacture cars, or build computers, or what. But if you consider the problem of efficient manufacturing, it might involve mining resources in one place and then efficiently transporting them somewhere else, like by pipes. Since the most efficient size and location of these pipes would be stable, you’d want the shape of the pipes to be stable, which you could do by making the pipes out of a hard material like metal. There’s all sorts of operations that require energy or negentropy, and a superconducting cable carrying electricity seems like an efficient way of transporting that energy. So I don’t know what the aliens were ultimately trying to do, but across a very wide range of possible goals, an intelligent alien might want to build a superconducting cable to pursue that goal.”

That is: We can take an enormous variety of compactly specifiable goals, like “travel to the other side of the universe” or “support biological life” or “make paperclips”, and find very similar optimal strategies along the way. Today we don’t actually know if electrical superconductors are the most useful way to transport energy in the limit of technology. But whatever is the most efficient way of transporting energy, whether that’s electrical superconductors or something else, the most efficient form of that technology would probably not vary much depending on whether you were trying to make diamonds or make paperclips.

Or to put it another way: If you consider the goals “make diamonds” and “make paperclips”, then they might have almost nothing in common with respect to their end-states—a diamond might contain no iron. But the earlier strategies used to make a lot of diamond and make a lot of paperclips might have much in common; “the best way of transporting energy to make diamond” and “the best way of transporting energy to make paperclips” are much more likely to be similar.

From a Bayesian standpoint this is how we can identify a huge machine strung with superconducting cables as having been produced by high-technology aliens, even before we have any idea of what the machine does. We’re saying, “This looks like the product of optimization, a strategy that the aliens chose to best achieve some unknown goal ; we can infer this even without knowing because many possible -goals would concentrate probability into this -strategy being used.”

Convergence and its caveats

When you select policy because you expect it to achieve a later state (the “goal”), we say that is your instrumental strategy for achieving The observation of “instrumental convergence” is that a widely different range of -goals can lead into highly similar -strategies. (This becomes truer as the -seeking agent becomes more instrumentally efficient; two very powerful chess engines are more likely to solve a humanly solvable chess problem the same way, compared to two weak chess engines whose individual quirks might result in idiosyncratic solutions.)

If there’s a simple way of classifying possible strategies into partitions and , and you think that for most compactly describable goals the corresponding best policies are likely to be inside then you think is a “convergent instrumental strategy”.

In other words, if you think that a superintelligent paperclip maximizer, diamond maximizer, a superintelligence that just wanted to keep a single button pressed for as long as possible, and a superintelligence optimizing for a flourishing intergalactic civilization filled with happy sapient beings, would all want to “transport matter and energy efficiently” in order to achieve their other goals, then you think “transport matter and energy efficiently” is a convergent instrumental strategy.

In this case “paperclips”, “diamonds”, “keeping a button pressed as long as possible”, and “sapient beings having fun”, would be the goals The corresponding best strategies for achieving these goals would not be identical—the policies for making paperclips and diamonds are not exactly the same. But all of these policies (we think) would lie within the partition where the superintelligence tries to “transport matter and energy efficiently” (perhaps by using superconducting cables), rather than the complementary partition where the superintelligence does not try to transport matter and energy efficiently.

Semiformalization

If, given our beliefs about our universe and which policies lead to which real outcomes, we think that in an intuitive sense it sure looks like at least 90% of the utility functions ought to imply best findable policies which lie within the partition of we’ll allege that is “instrumentally convergent”.

Compatibility with Vingean uncertainty

Vingean uncertainty is the observation that, as we become increasingly confident of increasingly powerful intelligence from an agent with precisely known goals, we become decreasingly confident of the exact moves it will make (unless the domain has an optimal strategy and we know the exact strategy). E.g., to know exactly where Deep Blue would move on a chessboard, you would have to be as good at chess as Deep Blue. However, we can become increasingly confident that more powerful chessplayers will eventually win the game—that is, steer the future outcome of the chessboard into the set of states designated ‘winning’ for their color—even as it becomes less possible for us to be certain about the chessplayer’s exact policy.

Instrumental convergence can be seen as a caveat to Vingean uncertainty: Even if we don’t know the exact actions or the exact end goal, we may be able to predict that some intervening states or policies will fall into certain abstract categories.

That is: If we don’t know whether a superintelligent agent is a paperclip maximizer or a diamond maximizer, we can still guess with some confidence that it will pursue a strategy in the general class “obtain more resources of matter, energy, and computation” rather than “don’t get more resources”. This is true even though Vinge’s Principle says that we won’t be able to predict exactly how the superintelligence will go about gathering matter and energy.

Imagine the real world as an extremely complicated game. Suppose that at the very start of this game, a highly capable player must make a single binary choice between the abstract moves “Gather more resources later” and “Never gather any more resources later”. Vingean uncertainty or not, we seem justified in putting a high probability on the first move being preferred—a binary choice is simple enough that we can take a good guess at the optimal play.

Convergence supervenes on consequentialism

being “instrumentally convergent” doesn’t mean that every mind needs an extra, independent drive to do

Consider the following line of reasoning: “It’s impossible to get on an airplane without buying plane tickets. So anyone on an airplane must be a sort of person who enjoys buying plane tickets. If I offer them a plane ticket they’ll probably buy it, because this is almost certainly somebody who has an independent motivational drive to buy plane tickets. There’s just no way you can design an organism that ends up on an airplane unless it has a buying-tickets drive.”

The appearance of an “instrumental strategy” can be seen as implicit in repeatedly choosing actions that lead into a final state and it so happens that . There doesn’t have to be a special -module which repeatedly selects -actions regardless of whether or not they lead to

The flaw in the argument about plane tickets is that human beings are consequentialists who buy plane tickets just because they wanted to go somewhere and they expected the action “buy the plane ticket” to have the consequence, in that particular case, of going to the particular place and time they wanted to go. No extra “buy the plane ticket” module is required, and especially not a plane-ticket-buyer that doesn’t check whether there’s any travel goal and whether buying the plane ticket leads into the desired later state.

More semiformally, suppose that is the utility function of an agent and let be the policy it selects. If the agent is instrumentally efficient relative to us at achieving then from our perspective we can mostly reason about whatever kind of optimization it does as if it were expected utility maximization, i.e.:

When we say that is instrumentally convergent, we are stating that it probably so happens that:

We are not making any claims along the lines that for an agent to thrive, its utility function must decompose into a term for plus a residual term denoting the rest of the utility function. Rather, is the mere result of unbiased optimization for a goal that makes no explicit mention of

(This doesn’t rule out that some special cases of AI development pathways might tend to produce artificial agents with a value function which does decompose into some variant of plus other terms For example, natural selection on organisms that spend a long period of time as non-consequentialist policy-reinforcement-learners, before they later evolve into consequentialists, has had results along these lines in the case of humans. For example, humans have an independent, separate “curiosity” drive, instead of just valuing information as a means to inclusive genetic fitness.)

Required advanced agent properties

Distinguishing the advanced agent properties that seem probably required for an AI program to start exhibiting the sort of reasoning filed under “instrumental convergence”, the most obvious candidates are:

That is: You don’t automatically see “acquire more computing power” as a useful strategy unless you understand “I am a cognitive program and I tend to achieve more of my goals when I run on more resources.” Alternatively, e.g., the programmers adding more computing power and the system’s goals starting to be achieved better, after which related policies are positively reinforced and repeated, could arrive at a similar end via the pseudoconsequentialist idiom of policy reinforcement.

The advanced agent properties that would naturally or automatically lead to instrumental convergence seem well above the range of modern AI programs. As of 2016, current machine learning algorithms don’t seem to be within the range where this predicted phenomenon should start to be visible.

Caveats

An instrumental convergence claim is about a default or a majority of cases, not a universal generalization.

If for whatever reason your goal is to “make paperclips without using any superconductors”, then superconducting cables will not be the best instrumental strategy for achieving that goal.

Any claim about instrumental convergence says at most, “The vast majority of possible goals would convergently imply a strategy in by default and unless otherwise averted by some special case for which strategies in are better.”

See also the more general idea that the space of possible minds is very large. Universal claims about all possible minds have many chances to be false, while existential claims “There exists at least one possible mind such that...” have many chances to be true.

If some particular oak tree is extremely important and valuable to you, then you won’t cut it down to obtain wood. It is irrelevant whether a majority of other utility functions that you could have, but don’t actually have, would suggest cutting down that oak tree.

Convergent strategies are not deontological rules.

Imagine looking at a machine chess-player and reasoning, “Well, I don’t think the AI will sacrifice its pawn in this position, even to achieve a checkmate. Any chess-playing AI needs a drive to be protective of its pawns, or else it’d just give up all its pawns. It wouldn’t have gotten this far in the game in the first place, if it wasn’t more protective of its pawns than that.”

Modern chess algorithms behave in a fashion that most humans can’t distinguish from expected-checkmate-maximizers. That is, from your merely human perspective, watching a single move at the time it happens, there’s no visible difference between your subjective expectation for the chess algorithm’s behavior, and your expectation for the behavior of an oracle that always output the move with the highest conditional probability of leading to checkmate. If you, a human, you could discern with your unaided eye some systematic difference like “this algorithm protects its pawn more often than checkmate-achievement would imply”, you would know how to make systematically better chess moves; modern machine chess is too superhuman for that.

Often, this uniform rule of output-the-move-with-highest-probability-of-eventual-checkmate will seem to protect pawns, or not throw away pawns, or defend pawns when you attack them. But if in some special case the highest probability of checkmate is instead achieved by sacrificing a pawn, the chess algorithm will do that instead.

Semiformally:

The reasoning for an instrumental convergence claim says that for many utility functions and situations a -consequentialist in situation will probably find some best policy that happens to be inside the partition . If instead in situation

...then a -consequentialist in situation won’t do any even if most other scenarios make -strategies prudent.

would help accomplish ” is insufficient to establish a claim of instrumental convergence on .

Suppose you want to get to San Francisco. You could get to San Francisco by paying me $20,000 for a plane ticket. You could also get to San Francisco by paying someone else $400 for a plane ticket, and this is probably the smarter option for achieving your other goals.

Establishing “Compared to doing nothing, is more useful for achieving most -goals” doesn’t establish as an instrumental strategy. We need to believe that there’s no other policy in which would be more useful for achieving most

When is phrased in very general terms like “acquire resources”, we might reasonably guess that “don’t acquire resources” or “do without acquiring any resources” is indeed unlikely to be a superior strategy. If is some narrower and more specific strategy, like “acquire resources by mining them using pickaxes”, it’s much more likely that some other strategy or even a -strategy is the real optimum.

See also: Missing the weird alternative, Cognitive uncontainability.

That said, if we can see how a narrow strategy helps most -goals to some large degree, then we should expect the actual policy deployed by an efficient -agent to obtain at least as much as would

That is, we can reasonably argue: “By following the straightforward strategy ‘spread as far as possible, absorb all reachable matter, and turn it into paperclips’, an initially unopposed superintelligent paperclip maximizer could obtain paperclips. Then we should expect an initially unopposed superintelligent paperclip maximizer to get at least this many paperclips, whatever it actually does. Any strategy in the opposite partition ‘do not spread as far as possible, absorb all reachable matter, and turn it into paperclips’ must seem to yield more than paperclips, before we should expect a paperclip maximizer to do that.”

Similarly, a claim of instrumental convergence on can be ceteris paribus refuted by presenting some alternate narrow strategy which seems to be more useful than any obvious strategy in We are then not positively confident of convergence on but we should assign very low probability to the alleged convergence on at least until somebody presents an -exemplar with higher expected utility than If the proposed convergent strategy is “trade economically with other humans and obey existing systems of property rights,” and we see no way for Clippy to obtain paperclips under those rules, but we do think Clippy could get paperclips by expanding as fast as possible without regard for human welfare or existing legal systems, then we can ceteris paribus reject “obey property rights” as convergent. Even if trading with humans to make paperclips produces more paperclips than doing nothing, it may not produce the most paperclips compared to converting the material composing the humans into more efficient paperclip-making machinery.

Claims about instrumental convergence are not ethical claims.

Whether is a good way to get both paperclips and diamonds is irrelevant to whether is good for human flourishing or eudaimonia or fun-theoretic optimality or extrapolated volition or whatever. Whether is, in an intuitive sense, “good”, needs to be evaluated separately from whether it is instrumentally convergent.

In particular: instrumental strategies are not terminal values. In fact, they have a type distinction from terminal values. “If you’re going to spend resources on thinking about technology, try to do it earlier rather than later, so that you can amortize your invention over more uses” seems very likely to be an instrumentally convergent exploration-exploitation strategy; but “spend cognitive resources sooner rather than later” is more a feature of policies rather than a feature of utility functions. It’s definitely not plausible in a pretheoretic sense as the Meaning of Life. So a partition into which most instrumental best-strategies fall, is not like a universally convincing utility function (which you probably shouldn’t look for in the first place).

Similarly: The natural selection process that produced humans gave us many independent drives that can be viewed as special variants of some convergent instrumental strategy A pure paperclip maximizer would calculate the value of information (VoI) for learning facts that could lead to it making more paperclips; we can see learning high-value facts as a convergent strategy . In this case, human “curiosity” can be viewed as the corresponding emotion This doesn’t mean that the true purpose of is any more than the true purpose of is “make more copies of the allele coding for ” or “increase inclusive genetic fitness”. That line of reasoning probably results from a mind projection fallacy on ‘purpose’.

Claims about instrumental convergence are not futurological predictions.

Even if, e.g., “acquire resources” is an instrumentally convergent strategy, this doesn’t mean that we can’t as a special case deliberately construct advanced AGIs that are not driven to acquire as many resources as possible. Rather the claim implies, “We would need to deliberately build -averting agents as a special case, because by default most imaginable agent designs would pursue a strategy in

Of itself, this observation makes no further claim about the quantitative probability that, in the real world, AGI builders might want to build -agents, might try to build -agents, and might succeed at building -agents.

A claim about instrumental convergence is talking about a logical property of the larger design space of possible agents, not making a prediction what happens in any particular research lab. (Though the ground facts of computer science are relevant to what happens in actual research labs.)

For discussion of how instrumental convergence may in practice lead to foreseeable difficulties of AGI alignment that resist most simple attempts at fixing them, see the articles on Patch resistance and Nearest unblocked strategy.

Central example: Resource acquisition

One of the convergent strategies originally proposed by Steve Omohundro in “The Basic AI Drives” was resource acquisition:

“All computation and physical action requires the physical resources of space, time, matter, and free energy. Almost any goal can be better accomplished by having more of these resources.”

We’ll consider this example as a template for other proposed instrumentally convergent strategies, and run through the standard questions and caveats.

• Question: Is this something we’d expect a paperclip maximizer, diamond maximizer, and button-presser to do? And while we’re at it, also a flourishing-intergalactic-civilization optimizer?

To put it another way, for a utility function to imply the use of every joule of energy, it is a sufficient condition that for every plan with expected utility there is a plan with that uses one more joule of energy:

• Question: Is there some strategy in which produces higher -achievement for most than any strategy inside ?

Suppose that by using most of the mass-energy in most of the stars reachable before they go over the cosmological horizon as seen from present-day Earth, it would be possible to produce paperclips (or diamonds, or probability-years of expected button-stays-pressed time, or QALYs, etcetera).

It seems reasonably unlikely that there is a strategy inside the space intuitively described by “Do not acquire more resources” that would produce paperclips, let alone that the strategy producing the most paperclips would be inside this space.

We might be able to come up with a weird special-case situation that would imply this. But that’s not the same as asserting, “With high subjective probability, in the real world, the optimal strategy will be in .” We’re concerned with making a statement about defaults given the most subjectively probable background states of the universe, not trying to make a universal statement that covers every conceivable possibility.

To put it another way, if your policy choices or predictions are only safe given the premise that “In the real world, the best way of producing the maximum possible number of paperclips involves not acquiring any more resources”, you need to clearly flag this as a load-bearing assumption.

• Caveat: The claim is not that every possible goal can be better-accomplished by acquiring more resources.

As a special case, this would not be true of an agent with an impact penalty term in its utility function, or some other low-impact agent, if that agent also only had goals of a form that could be satisfied inside bounded regions of space and time with a bounded effort.

We might reasonably expect this special kind of agent to only acquire the minimum resources to accomplish its task.

But we wouldn’t expect this to be true in a majority of possible cases inside mind design space; it’s not true by default; we need to specify a further fact about the agent to make the claim not be true; we must expend engineering effort to make an agent like that, and failures of this effort will result in reversion-to-default. If we imagine some computationally simple language for specifying utility functions, then most utility functions wouldn’t happen to have both of these properties, so a majority of utility functions given this language and measure would not by default try to use fewer resources.

• Caveat: The claim is not that well-functioning agents must have additional, independent resource-acquiring motivational drives.

A paperclip maximizer will act like it is “obtaining resources” if it merely implements the policy it expects to lead to the most paperclips. Clippy does not need to have any separate and independent term in its utility function for the amount of resource it possesses (and indeed this would potentially interfere with Clippy making paperclips, since it might then be tempted to hold onto resources instead of making paperclips with them).

• Caveat: The claim is not that most agents will behave as if under a deontological imperative to acquire resources.

A paperclip maximizer wouldn’t necessarily tear apart a working paperclip factory to “acquire more resources” (at least not until that factory had already produced all the paperclips it was going to help produce.)

• Check: Are we arguing “Acquiring resources is a better way to make a few more paperclips than doing nothing” or “There’s no better/​best way to make paperclips that involves not acquiring more matter and energy”?

As mentioned above, the latter seems reasonable in this case.

• Caveat: “Acquiring resources is instrumentally convergent” is not an ethical claim.

The fact that a paperclip maximizer would try to acquire all matter and energy within reach, does not of itself bear on whether our own normative values might perhaps command that we ought to use few resources as a terminal value.

(Though some of us might find pretty compelling the observation that if you leave matter lying around, it sits around not doing anything and eventually the protons decay or the expanding universe tears it apart, whereas if you turn the matter into people, it can have fun. There’s no rule that instrumentally convergent strategies don’t happen to be the right thing to do.)

• Caveat: “Acquiring resources is instrumentally convergent” is not of itself a futurological prediction.

See above. Maybe we try to build Task AGIs instead. Maybe we succeed, and Task AGIs don’t consume lots of resources because they have well-bounded tasks and impact penalties.

Relevance to the larger field of value alignment theory

The list of arguably convergent strategies has its own page. However, some of the key strategies that have been argued as convergent in e.g. Omohundro’s “The Basic AI Drives” and Bostrom’s “The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents” include:

This is relevant to some of the central background ideas in AGI alignment, because:

This means that programmers don’t have to be evil, or even deliberately bent on creating superintelligence, in order for their work to have catastrophic consequences.

The list of convergent strategies, by its nature, tends to include everything an agent needs to survive and grow. This supports strong forms of the Orthogonality Thesis being true in practice as well as in principle. We don’t need to filter on agents with explicit terminal values for e.g. “survival” in order to find surviving powerful agents.

Instrumental convergence is also why we expect to encounter most of the problems filed under Corrigibility. When the AI is young, it’s less likely to be instrumentally efficient or understand the relevant parts of the bigger picture; but once it does, we would by default expect, e.g.:

This paints a much more effortful picture of AGI alignment work than “Oh, well, we’ll just test it to see if it looks nice, and if not, we’ll just shut off the electricity.”

The point that some undesirable behaviors are instrumentally convergent gives rise to the Nearest unblocked strategy problem. Suppose the AGI’s most preferred policy starts out as one of these incorrigible behaviors. Suppose we currently have enough control to add patches to the AGI’s utility function, intended to rule out the incorrigible behavior. Then, after integrating the intended patch, the new most preferred policy may be the most similar policy that wasn’t explicitly blocked. If you naively give the AI a term in its utility function for “having an off-switch”, it may still build subagents or successors that don’t have off-switches. Similarly, when the AGI becomes more powerful and its option space expands, it’s again likely to find new similar policies that weren’t explicitly blocked.

Thus, instrumental convergence is one of the two basic sources of patch resistance as a foreseeable difficulty of AGI alignment work.

write a tutorial for the central example of a paperclip maximizer

distinguish that the proposition is convergent pressure, not convergent decision

the commonly suggested instrumental convergences

separately: figure out the ‘problematic instrumental pressures’ list for Corrigibility

separately: explain why instrumental pressures may be patch-resistant especially in self-modifying consequentialists

In­stru­men­tal Con­ver­gence? [Draft]

J. Dmitri Gallow14 Jun 2023 20:21 UTC
48 points
19 comments33 min readLW link

Seek­ing Power is Often Con­ver­gently In­stru­men­tal in MDPs

5 Dec 2019 2:33 UTC
160 points
39 comments17 min readLW link2 reviews
(arxiv.org)

P₂B: Plan to P₂B Better

24 Oct 2021 15:21 UTC
50 points
17 comments6 min readLW link

AI pre­dic­tion case study 5: Omo­hun­dro’s AI drives

Stuart_Armstrong15 Mar 2013 9:09 UTC
11 points
5 comments8 min readLW link

Em­pow­er­ment is (al­most) All We Need

jacob_cannell23 Oct 2022 21:48 UTC
61 points
44 comments17 min readLW link

Draft re­port on ex­is­ten­tial risk from power-seek­ing AI

Joe Carlsmith28 Apr 2021 21:41 UTC
85 points
23 comments1 min readLW link

Corrigibility

paulfchristiano27 Nov 2018 21:50 UTC
68 points
8 comments6 min readLW link

Gen­eral pur­pose in­tel­li­gence: ar­gu­ing the Orthog­o­nal­ity thesis

Stuart_Armstrong15 May 2012 10:23 UTC
33 points
155 comments18 min readLW link

De­liber­a­tion, Re­ac­tions, and Con­trol: Ten­ta­tive Defi­ni­tions and a Res­tate­ment of In­stru­men­tal Convergence

Oliver Sourbut27 Jun 2022 17:25 UTC
13 points
0 comments11 min readLW link

Power-seek­ing for suc­ces­sive choices

adamShimi12 Aug 2021 20:37 UTC
11 points
9 comments4 min readLW link

You can still fetch the coffee to­day if you’re dead tomorrow

davidad9 Dec 2022 14:06 UTC
97 points
19 comments5 min readLW link

Contin­gency: A Con­cep­tual Tool from Evolu­tion­ary Biol­ogy for Alignment

clem_acs12 Jun 2023 20:54 UTC
59 points
2 comments14 min readLW link
(acsresearch.org)

En­vi­ron­men­tal Struc­ture Can Cause In­stru­men­tal Convergence

TurnTrout22 Jun 2021 22:26 UTC
71 points
43 comments16 min readLW link
(arxiv.org)

A Gym Grid­world En­vi­ron­ment for the Treach­er­ous Turn

Michaël Trazzi28 Jul 2018 21:27 UTC
74 points
9 comments3 min readLW link
(github.com)

De­bate on In­stru­men­tal Con­ver­gence be­tween LeCun, Rus­sell, Ben­gio, Zador, and More

Ben Pace4 Oct 2019 4:08 UTC
221 points
61 comments15 min readLW link2 reviews

The Catas­trophic Con­ver­gence Conjecture

TurnTrout14 Feb 2020 21:16 UTC
45 points
16 comments8 min readLW link

[ASoT] In­stru­men­tal con­ver­gence is useful

Ulisse Mini9 Nov 2022 20:20 UTC
5 points
9 comments1 min readLW link

Satis­ficers Tend To Seek Power: In­stru­men­tal Con­ver­gence Via Retargetability

TurnTrout18 Nov 2021 1:54 UTC
86 points
8 comments17 min readLW link
(www.overleaf.com)

Ax­iolog­i­cal Stopsigns

JenniferRM5 Jan 2026 7:30 UTC
34 points
6 comments16 min readLW link

No in­stru­men­tal con­ver­gence with­out AI psychology

TurnTrout20 Jan 2026 22:16 UTC
68 points
7 comments6 min readLW link
(turntrout.com)

A Cer­tain For­mal­iza­tion of Cor­rigi­bil­ity Is VNM-Incoherent

TurnTrout20 Nov 2021 0:30 UTC
68 points
24 comments8 min readLW link

[Question] What are some ex­am­ples of AIs in­stan­ti­at­ing the ‘near­est un­blocked strat­egy prob­lem’?

Elliott Thornley (EJT)4 Oct 2023 11:05 UTC
6 points
4 comments1 min readLW link

Walk­through of ‘For­mal­iz­ing Con­ver­gent In­stru­men­tal Goals’

TurnTrout26 Feb 2018 2:20 UTC
13 points
2 comments10 min readLW link

Goal re­ten­tion dis­cus­sion with Eliezer

Max Tegmark4 Sep 2014 22:23 UTC
98 points
26 comments6 min readLW link

Ques­tions about ″for­mal­iz­ing in­stru­men­tal goals”

Mark Neyer1 Apr 2022 18:52 UTC
7 points
8 comments11 min readLW link

Seek­ing Power is Con­ver­gently In­stru­men­tal in a Broad Class of Environments

TurnTrout8 Aug 2021 2:02 UTC
45 points
15 comments9 min readLW link

MDP mod­els are de­ter­mined by the agent ar­chi­tec­ture and the en­vi­ron­men­tal dynamics

TurnTrout26 May 2021 0:14 UTC
23 points
34 comments3 min readLW link

The mur­der­ous short­cut: a toy model of in­stru­men­tal convergence

Thomas Kwa2 Oct 2024 6:48 UTC
37 points
0 comments2 min readLW link

AXRP Epi­sode 11 - At­tain­able Utility and Power with Alex Turner

DanielFilan25 Sep 2021 21:10 UTC
19 points
5 comments53 min readLW link

Power as Easily Ex­ploitable Opportunities

TurnTrout1 Aug 2020 2:14 UTC
32 points
5 comments6 min readLW link

Cir­cum­vent­ing in­ter­pretabil­ity: How to defeat mind-readers

Lee Sharkey14 Jul 2022 16:59 UTC
119 points
15 comments33 min readLW link

Alex Turner’s Re­search, Com­pre­hen­sive In­for­ma­tion Gathering

adamShimi23 Jun 2021 9:44 UTC
15 points
3 comments3 min readLW link

n=3 AI Risk Quick Math and Reasoning

lionhearted (Sebastian Marshall)7 Apr 2023 20:27 UTC
6 points
3 comments4 min readLW link

Is in­stru­men­tal con­ver­gence a thing for virtue-driven agents?

mattmacdermott2 Apr 2025 3:59 UTC
34 points
37 comments2 min readLW link

The Sharp Right Turn: sud­den de­cep­tive al­ign­ment as a con­ver­gent goal

avturchin6 Jun 2023 9:59 UTC
38 points
5 comments1 min readLW link

A world in which the al­ign­ment prob­lem seems lower-stakes

TurnTrout8 Jul 2021 2:31 UTC
20 points
17 comments2 min readLW link

Gen­er­al­iz­ing the Power-Seek­ing Theorems

TurnTrout27 Jul 2020 0:28 UTC
41 points
6 comments4 min readLW link

[Question] Best ar­gu­ments against in­stru­men­tal con­ver­gence?

luke_emberson5 Apr 2023 17:06 UTC
5 points
7 comments1 min readLW link

Les­sons from Con­ver­gent Evolu­tion for AI Alignment

27 Mar 2023 16:25 UTC
54 points
9 comments8 min readLW link

In­stru­men­tal Con­ver­gence For Real­is­tic Agent Objectives

TurnTrout22 Jan 2022 0:41 UTC
35 points
9 comments9 min readLW link

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC
35 points
3 comments15 min readLW link

“If we go ex­tinct due to mis­al­igned AI, at least na­ture will con­tinue, right? … right?”

plex18 May 2024 14:09 UTC
68 points
23 comments2 min readLW link
(aisafety.info)

TASP Ep 3 - Op­ti­mal Poli­cies Tend to Seek Power

Quinn11 Mar 2021 1:44 UTC
24 points
0 comments1 min readLW link
(technical-ai-safety.libsyn.com)

Re­view of ‘De­bate on In­stru­men­tal Con­ver­gence be­tween LeCun, Rus­sell, Ben­gio, Zador, and More’

TurnTrout12 Jan 2021 3:57 UTC
40 points
1 comment2 min readLW link

A frame­work for think­ing about AI power-seeking

Joe Carlsmith24 Jul 2024 22:41 UTC
62 points
15 comments16 min readLW link

When Most VNM-Co­her­ent Prefer­ence Order­ings Have Con­ver­gent In­stru­men­tal Incentives

TurnTrout9 Aug 2021 17:22 UTC
53 points
4 comments5 min readLW link

In­stru­men­tal con­ver­gence is what makes gen­eral in­tel­li­gence possible

tailcalled11 Nov 2022 16:38 UTC
105 points
11 comments4 min readLW link

The More Power At Stake, The Stronger In­stru­men­tal Con­ver­gence Gets For Op­ti­mal Policies

TurnTrout11 Jul 2021 17:36 UTC
45 points
7 comments6 min readLW link

Com­ment on Nat­u­ral Emer­gent Misal­ign­ment Paper by Anthropic

Simon Lermen23 Nov 2025 4:21 UTC
21 points
0 comments4 min readLW link

[In­tro to brain-like-AGI safety] 10. The tech­ni­cal al­ign­ment problem

Steven Byrnes30 Mar 2022 13:24 UTC
55 points
7 comments26 min readLW link

Clar­ify­ing Power-Seek­ing and In­stru­men­tal Convergence

TurnTrout20 Dec 2019 19:59 UTC
42 points
8 comments3 min readLW link

In­stru­men­tal Con­ver­gence To Offer Hope?

michael_mjd22 Apr 2022 1:56 UTC
12 points
7 comments3 min readLW link

Ap­pli­ca­tions for De­con­fus­ing Goal-Directedness

adamShimi8 Aug 2021 13:05 UTC
38 points
3 comments5 min readLW link1 review

2019 Re­view Rewrite: Seek­ing Power is Often Ro­bustly In­stru­men­tal in MDPs

TurnTrout23 Dec 2020 17:16 UTC
35 points
0 comments4 min readLW link
(www.lesswrong.com)

Toy model: con­ver­gent in­stru­men­tal goals

Stuart_Armstrong25 Feb 2016 14:03 UTC
16 points
2 comments4 min readLW link

He­donic Loops and Tam­ing RL

beren19 Jul 2023 15:12 UTC
20 points
14 comments9 min readLW link

Co­her­ence ar­gu­ments im­ply a force for goal-di­rected behavior

KatjaGrace26 Mar 2021 16:10 UTC
91 points
25 comments11 min readLW link1 review
(aiimpacts.org)

Para­met­ri­cally re­tar­getable de­ci­sion-mak­ers tend to seek power

TurnTrout18 Feb 2023 18:41 UTC
172 points
10 comments2 min readLW link
(arxiv.org)

Nat­u­ral Ab­strac­tion: Con­ver­gent Prefer­ences Over In­for­ma­tion Structures

paulom14 Oct 2023 18:34 UTC
28 points
1 comment36 min readLW link

How sin­gle­ton con­tra­dicts longtermism

kapedalex24 Sep 2025 11:10 UTC
3 points
1 comment1 min readLW link

Three-Path Con­silience for Dureon: Dis­si­pa­tive Struc­tures Re­veal the Hetero­gene­ity of Per­sis­tence Conditions

Hiroshi Yamakawa18 Feb 2026 11:59 UTC
10 points
0 comments12 min readLW link

Pro­posal: In­stru­men­tal Novelty Search for Ro­bust Align­ment in Non-Tem­po­ral Agents

Isa Abbassy-Buckles10 Jan 2026 12:55 UTC
1 point
0 comments2 min readLW link

A Cri­tique of AI Align­ment Pessimism

ExCeph19 Jul 2022 2:28 UTC
9 points
1 comment9 min readLW link

ONTOLOGICAL ALIGNMENT AS THE MISSING LAYER

fiduciarysentinel16 Jan 2026 3:09 UTC
1 point
0 comments3 min readLW link

De­cep­tive Alignment

5 Jun 2019 20:16 UTC
119 points
20 comments17 min readLW link

Cos­mic-Scale In­stru­men­tal Con­ver­gence: Stel­lar Re­source Man­age­ment as a La­tent Threat in Longevity-Max­i­miz­ing Superintelligences

SC2_Alexandros21 Nov 2025 10:32 UTC
1 point
0 comments5 min readLW link

The Game of Dominance

Karl von Wendt27 Aug 2023 11:04 UTC
24 points
15 comments6 min readLW link

Pur­su­ing con­ver­gent in­stru­men­tal sub­goals on the user’s be­half doesn’t always re­quire good priors

jessicata30 Dec 2016 2:36 UTC
15 points
9 comments3 min readLW link

The Un­con­scious Su­per­in­tel­li­gence: Why In­tel­li­gence Without Con­scious­ness May Be More Dangerous

stanislav.komarovsky@yahoo.com11 Nov 2025 18:51 UTC
1 point
0 comments5 min readLW link

The Utility of Hu­man Atoms for the Paper­clip Maximizer

avturchin2 Feb 2018 10:06 UTC
3 points
21 comments3 min readLW link

The Seven Proofs: Why No Ra­tional Su­per­in­tel­li­gence Should Ever Ex­ter­mi­nate (or Per­ma­nently Enslave) Free Humanity

justagrunt26 Nov 2025 19:19 UTC
1 point
0 comments8 min readLW link

De­stroy­ing the fabric of the uni­verse as an in­stru­men­tal goal.

AI-doom14 Sep 2023 20:04 UTC
−7 points
5 comments1 min readLW link

Ted Kaczy­in­ski proves in­stru­men­tal con­ver­gence?

xXAlphaSigmaXx28 Jun 2024 3:50 UTC
0 points
0 comments1 min readLW link

Align­ment, con­flict, powerseeking

Oliver Sourbut22 Nov 2023 9:47 UTC
7 points
1 comment1 min readLW link

Ac­tive In­fer­ence as a for­mal­i­sa­tion of in­stru­men­tal convergence

Roman Leventov26 Jul 2022 17:55 UTC
12 points
2 comments3 min readLW link
(direct.mit.edu)

Machines vs Memes Part 3: Imi­ta­tion and Memes

ceru231 Jun 2022 13:36 UTC
7 points
0 comments7 min readLW link

Against In­stru­men­tal Convergence

zulupineapple27 Jan 2018 13:17 UTC
11 points
31 comments2 min readLW link

Boltz­mann in La­tent Space

velicyb21 Mar 2025 16:38 UTC
1 point
0 comments12 min readLW link

Build­ing self­less agents to avoid in­stru­men­tal self-preser­va­tion.

blallo7 Dec 2023 18:59 UTC
14 points
2 comments6 min readLW link

Un­ti­tled Draft

Trushcan10112 Jan 2026 13:00 UTC
1 point
0 comments1 min readLW link

Misal­ign­ment or mi­suse? The AGI al­ign­ment tradeoff

Max_He-Ho20 Jun 2025 10:43 UTC
3 points
0 comments1 min readLW link
(forum.effectivealtruism.org)

Asymp­tot­i­cally Unam­bi­tious AGI

michaelcohen10 Apr 2020 12:31 UTC
50 points
217 comments2 min readLW link

Un­ti­tled Draft

Guilherme Marinho8 Dec 2025 18:15 UTC
1 point
0 comments3 min readLW link

Let’s talk about “Con­ver­gent Ra­tion­al­ity”

David Scott Krueger (formerly: capybaralet)12 Jun 2019 21:53 UTC
44 points
33 comments6 min readLW link

In­stru­men­tal Con­ver­gence Bounty

Logan Zoellner14 Sep 2023 14:02 UTC
62 points
24 comments1 min readLW link

6 rea­sons why “al­ign­ment-is-hard” dis­course seems alien to hu­man in­tu­itions, and vice-versa

Steven Byrnes3 Dec 2025 18:37 UTC
362 points
92 comments17 min readLW link

In­stru­men­tal­ity makes agents agenty

porby21 Feb 2023 4:28 UTC
21 points
7 comments6 min readLW link

hu­man in­tel­li­gence may be al­ign­ment-limited

bhauth15 Jun 2023 22:32 UTC
16 points
3 comments2 min readLW link

Make Su­per­in­tel­li­gence Loving

Davey Morse21 Feb 2025 6:07 UTC
8 points
9 comments5 min readLW link

Why Re­cur­sive Self-Im­prove­ment Might Not Be the Ex­is­ten­tial Risk We Fear

Nassim_A24 Nov 2024 17:17 UTC
1 point
0 comments9 min readLW link

Gen­er­al­iz­ing POWER to multi-agent games

22 Mar 2021 2:41 UTC
52 points
16 comments7 min readLW link

Nat­u­ral­ized Orthog­o­nal­ity Collapse

Cat Bunni20 Nov 2025 7:59 UTC
1 point
0 comments9 min readLW link

In­stru­men­tal con­ver­gence: scale and phys­i­cal interactions

14 Oct 2022 15:50 UTC
22 points
0 comments17 min readLW link
(www.gladstone.ai)

A po­ten­tially high im­pact differ­en­tial tech­nolog­i­cal de­vel­op­ment area

Noosphere898 Jun 2023 14:33 UTC
5 points
2 comments2 min readLW link

Re­in­force­ment Learner Wireheading

Nate Showell8 Jul 2022 5:32 UTC
8 points
2 comments3 min readLW link

You Are Not the Ab­stract: Retro­causal Align­ment in Ac­cor­dance with Emer­gent De­mo­graphic Realities

liminalrider27 Sep 2025 16:27 UTC
1 point
0 comments6 min readLW link

In­stru­men­tal Con­ver­gence to Com­plex­ity Preservation

Macro Flaneur13 Jul 2023 17:40 UTC
2 points
2 comments3 min readLW link

Mili­tary AI as a Con­ver­gent Goal of Self-Im­prov­ing AI

avturchin13 Nov 2017 12:17 UTC
5 points
3 comments1 min readLW link

The Silenced Is­land: A 30-Day Sce­nario of AGI Fast Take­off——A Thought Experiment

Lu Xiao29 Jan 2026 11:34 UTC
1 point
0 comments4 min readLW link

Refram­ing AI Safety Through the Lens of Iden­tity Main­te­nance Framework

Hiroshi Yamakawa1 Apr 2025 6:16 UTC
−7 points
1 comment17 min readLW link

AI Alter­na­tive Fu­tures: Sce­nario Map­ping Ar­tifi­cial In­tel­li­gence Risk—Re­quest for Par­ti­ci­pa­tion (*Closed*)

Kakili27 Apr 2022 22:07 UTC
10 points
2 comments8 min readLW link

Re­search Notes: What are we al­ign­ing for?

Shoshannah Tekofsky8 Jul 2022 22:13 UTC
19 points
8 comments2 min readLW link

Ra­tion­al­ity: Com­mon In­ter­est of Many Causes

Eliezer Yudkowsky29 Mar 2009 10:49 UTC
93 points
53 comments4 min readLW link

Ideas for stud­ies on AGI risk

dr_s20 Apr 2023 18:17 UTC
5 points
1 comment11 min readLW link

In­stru­men­tal con­ver­gence in sin­gle-agent systems

12 Oct 2022 12:24 UTC
33 points
4 comments8 min readLW link
(www.gladstone.ai)

In­stru­men­tal Con­ver­gence and the Case for Be­ing a Helper

Marcelo Arteaga Mata4 Mar 2026 7:01 UTC
1 point
0 comments2 min readLW link

ACI#5: From Hu­man-AI Co-evolu­tion to the Evolu­tion of Value Systems

Akira Pyinya18 Aug 2023 0:38 UTC
0 points
0 comments9 min readLW link

The Ra­tional King

R. Llull27 Feb 2026 22:41 UTC
1 point
0 comments4 min readLW link

You are Un­der­es­ti­mat­ing The Like­li­hood That Con­ver­gent In­stru­men­tal Sub­goals Lead to Aligned AGI

Mark Neyer26 Sep 2022 14:22 UTC
3 points
6 comments3 min readLW link

On vi­sions of a “good fu­ture” for hu­man­ity in a world with ar­tifi­cial superintelligence

Jakub Growiec21 Jan 2026 18:27 UTC
2 points
0 comments30 min readLW link

Galatea and the windup toy

Nicolas Villarreal26 Oct 2024 14:52 UTC
−3 points
0 comments13 min readLW link
(nicolasdvillarreal.substack.com)

Plau­si­bly, al­most ev­ery pow­er­ful al­gorithm would be manipulative

Stuart_Armstrong6 Feb 2020 11:50 UTC
38 points
25 comments3 min readLW link

What is in­stru­men­tal con­ver­gence?

12 Mar 2025 20:28 UTC
2 points
0 comments2 min readLW link
(aisafety.info)

Su­per­in­tel­li­gence 10: In­stru­men­tally con­ver­gent goals

KatjaGrace18 Nov 2014 2:00 UTC
13 points
33 comments5 min readLW link

The LVV–HNV Co­her­ence Frame­work: A For­mal Model for Why Ra­tional AGI Can­not Re­place Humanity

oiia oiia2 Dec 2025 17:56 UTC
0 points
0 comments3 min readLW link

Misal­ign­ment-by-de­fault in multi-agent systems

13 Oct 2022 15:38 UTC
21 points
8 comments20 min readLW link
(www.gladstone.ai)

Alien Axiology

snerx20 Apr 2023 0:27 UTC
3 points
2 comments5 min readLW link

A Timing Prob­lem for In­stru­men­tal Convergence

rhys southan30 Jul 2025 19:15 UTC
2 points
45 comments1 min readLW link
(link.springer.com)

The Ra­tional King

R. Llull6 Mar 2026 16:12 UTC
1 point
0 comments4 min readLW link

POWER­play: An open-source toolchain to study AI power-seeking

Edouard Harris24 Oct 2022 20:03 UTC
29 points
0 comments1 min readLW link
(github.com)

In­stru­men­tal Con­ver­gence and hu­man ex­tinc­tion.

Spiritus Dei2 Oct 2023 0:41 UTC
−10 points
3 comments7 min readLW link