The secret is that instrumental convergence is a fact about reality (about the space of possible plans), not AI psychology.
Zack M. Davis, group discussion
Such arguments flitter around the AI safety space. While these arguments contain some truth, they attempt to escape “AI psychology” but necessarily fail. To predict bad outcomes from AI, one must take a stance on how AI will tend to select plans.
-
This topic is a specialty of mine. Where does instrumental convergence come from? Since I did my alignment PhD on exactly this question, I’m well-suited to explain the situation.
-
In this article, I do not argue that building transformative AI is safe or that transformative AIs won’t tend to select dangerous plans. I simply argue against the claim that “instrumental convergence arises from reality / plan-space [1] itself, independently of AI psychology.”
This post is best read on my website, but I’ve reproduced it here as well.
Two kinds of convergence
Working definition: When I say “AI psychology”, I mean to include anything which affects how the AI computes which action to take next. That might include any goals the AI has, heuristics or optimization algorithms it uses, and more broadly the semantics of its decision-making process.
Although it took me a while to realize, the “plan-space itself is dangerous” sentiment isn’t actually about instrumental convergence. The sentiment concerns a related but distinct concept.
-
Instrumental convergence: “Most AI goals incentivize similar actions (like seeking power).” Bostrom gives the classic definition.
-
Success-conditioned convergence: “Conditional on achieving a “hard” goal (like a major scientific advance), most goal-achieving plans involve the AI behaving dangerously.” I’m coining this term to distinguish it from instrumental convergence.
Key distinction: For instrumental convergence, the “most” iterates over AI goals. For success-conditioned convergence, the “most” iterates over plans.
Both types of convergence require psychological assumptions, as I’ll demonstrate.
Tracing back the “dangerous plan-space” claim
In 2023, Rob Bensinger gave a more detailed presentation of Zack’s claim.
The basic reasons I expect AGI ruin by Rob Bensinger
If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like “invent fast-running whole-brain emulation (WBE)”, then hitting a button to execute the plan would kill all humans, with very high probability. [...]
The danger is in the cognitive work, not in some complicated or emergent feature of the “agent”; it’s in the task itself.
It isn’t that the abstract space of plans was built by evil human-hating minds; it’s that the instrumental convergence thesis holds for the plans themselves. In full generality, plans that succeed in goals like “build WBE” tend to be dangerous.
This isn’t true of all plans that successfully push our world into a specific (sufficiently-hard-to-reach) physical state, but it’s true of the vast majority of them.
What reality actually determines
The “plan-space is dangerous” argument contains an important filament of truth.
Reality determines possible results
“Reality” meets the AI in the form of the environment. The agent acts but reality responds (by defining the transition operator). Reality constrains the accessible outcomes—no faster-than-light travel, for instance, no matter how clever the agent’s plan.
Imagine I’m in the middle of a long hallway. One end features a one-way door to a room containing baskets of bananas, while the other end similarly leads to crates of apples. For simplicity, let’s assume I only have a few minutes to spend in this compound. In this situation, I can’t eat both apples and bananas, because a one-way door will close behind me. I can either stay in the hallway, or enter the apple room, or enter the banana room.
Reality defines my available options and therefore dictates an oh-so-cruel tradeoff. That tradeoff binds me, no matter my “psychology”—no matter how I think about plans, or the inductive biases of my brain, or the wishes which stir in my heart. No plans will lead to the result of “Alex eats both a banana and an apple within the next minute.” Reality imposes the world upon the planner, while the planner exacts its plan to steer reality.
Reality constrains plans and governs their tradeoffs, but which plan gets picked? That question is a matter of AI psychology.
Reality determines the alignment tax, not the convergence
To predict dangerous behavior from an AI, you need to assume some plan-generating function which chooses from (the set of possible plans). [2] When thinkers argue that danger lurks “in the task itself”, they implicitly assert that is of the form
In a reality where safe plans are hard to find, are more complicated, or have a lower success probability, then may indeed produce dangerous plans. But this is not solely a fact about —it’s a fact about how interacts with and the tradeoffs those plans imply.
Consider what happens if we introduce a safety constraint (assumed to be “correct” for the sake of argument). The constrained plan-generating function will not produce dangerous plans. Rather, it will succeed with a lower probability. The alignment tax exists in the difference in success probability between a pure success maximizer () and .
To say the alignment tax is “high” is a claim about reality. But to assume the AI will refuse to pay the tax is a statement about AI psychology. [3]
Consider the extremes.
Maximum alignment tax
If there’s no aligned way to succeed at all, then no matter the psychology, the danger is in trying to succeed at all. “Torturing everyone forever” seems like one such task. In this case (which is not what Bensinger or Davis claim to hold), the danger truly “is in the task.”
Zero alignment tax
If safe plans are easy to find, then danger purely comes from the “AI psychology” (via the plan-generating function).
In-between
Reality dictates the alignment tax, which dictates the tradeoffs available to the agent. However, the agent’s psychology dictates how it makes those tradeoffs: whether (and how) it would sacrifice safety for success; whether the AI is willing to lie; how to generate possible plans; which kinds of plans to consider next; and so on. Thus, both reality and psychology produce the final output.
I am not being pedantic. Gemini Pro 3.0 and MechaHitler implement different plan-generating functions . Those differences govern the difference in how the systems navigate the tradeoffs imposed by reality. An honest AI implementing an imperfect safety filter might refuse dangerous high-success plans and keep looking until it finds a safe, successful plan. MechaHitler seems less likely to do so.
Why both convergence types require psychology
I’ve shown that reality determines the alignment tax but not which plans get selected. Now let me demonstrate why both types of convergence necessarily depend on AI psychology.
Instrumental convergence depends on psychology
Instrumental convergence depends on AI psychology, as demonstrated by my paper Parametrically Retargetable Decision-Makers Tend to Seek Power. In short, AI psychology governs the mapping from “AI motivations” to “AI plans”. Certain psychologies induce mappings which satisfy my theorems, which are sufficient conditions to prove instrumental convergence.
More precisely, instrumental convergence arises from statistical tendencies in a plan-generating function -- “what the AI does given a ‘goal’”—relative to its inputs (“goals”). The convergence builds off of assumptions about that function’s semantics and those inputs. These assumptions can be satisfied by:
satisficing over utility functions over the state of the world, or perhaps
some kind of more realistic & less crisp decision-making.
Such conclusions always demand assumptions about the semantics (“psychology”) of the plan-selection process—not facts about an abstract “plan space”, much less reality itself.
Success-conditioned convergence depends on psychology
Success-conditioned convergence feels free of AI psychology—we’re only assuming the completion of a goal, and we want our real AIs to complete goals for us. However, this intuition is incorrect.
Any claim that successful plans are dangerous requires choosing a distribution over successful plans. Bensinger proposes a length-weighted distribution, but this is still a psychological assumption about how AIs generate and select plans. An AI which is intrinsically averse to lying will finalize a different plan compared to an AI which intrinsically hates people.
Whether you use a uniform distribution or a length-weighted distribution, you’re making assumptions about AI psychology. Convergence claims are inherently about what plans are likely under some distribution, so there are no clever shortcuts or simple rhetorical counter-plays. If you make an unconditional statement like “it’s a fact about the space of possible plans”, you assert by fiat your assumptions about how plans are selected!
Reconsidering the original claims
The secret is that instrumental convergence is a fact about reality (about the space of possible plans), not AI psychology.
Zack M. Davis, group discussion
The basic reasons I expect AGI ruin
If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like “invent fast-running whole-brain emulation”, then hitting a button to execute the plan would kill all humans, with very high probability. [...]
The danger is in the cognitive work, not in some complicated or emergent feature of the “agent”; it’s in the task itself.
It isn’t that the abstract space of plans was built by evil human-hating minds; it’s that the instrumental convergence thesis holds for the plans themselves.
Two key problems with this argument:
-
Terminology confusion: The argument does not discuss “instrumental convergence”. Instead, it discusses (what I call) “success-conditioned convergence.” (This distinction was subtle to me as well.)
-
Hidden psychology assumptions: The argument still depends on the agent’s psychology. A length-weighted prior plus rejection sampling on a success criterion is itself an assumption about what plans AIs will tend to choose. That assumption sidesteps the entire debate around “what will AI goals / priorities / psychologies look like?” Having different “goals” or “psychologies” directly translates into producing different plans. Neither type of convergence stands independently of psychology.
Perhaps a different, weaker claim still holds, though:
A valid conjecture someone might make: The default psychology you get from optimizing hard for success will induce plan-generating functions which select dangerous plans, in large part due to the high density of unsafe plans.
Conclusion
Reality determines the alignment tax of safe plans. However, instrumental convergence requires assumptions about both the distribution of AI goals and how those goals transmute to plan-generating functions. Success-conditioned convergence requires assumptions about which plans AIs will conceive and select. Both sets of assumptions involve AI psychology.
Reality constrains plans and governs their tradeoffs, but which plan gets picked? That question is always a matter of AI psychology.
Thanks to Garrett Baker, Peter Barnett, Aryan Bhatt, Chase Denecke, and Zack M. Davis for giving feedback.
- ↩︎
I prefer to refer to “the set of possible plans”, as “plan-space” evokes the structured properties of vector spaces.
- ↩︎
is itself ill-defined, but I’ll skip over that for this article because it’d be a lot of extra words for little extra insight.
- ↩︎
To argue “success maximizers are more profitable and more likely to be deployed” is an argument about economic competition, which itself is an argument about the tradeoff between safety and success, which in turn requires reasoning about AI psychology.
Maybe “psychology” is just the wrong word to use here, because I think it conjures up ideas of anthropomorphism, when in fact I read you as simply making an argument that the processes interior to an AI system matter as to whether and how an AI might try to instrumentally converge towards some goals.
(I happen to think your overall point is right, because goals don’t exist except in the service of some purpose (in the cybernetic sense), and so we have to know something about the purpose of a system, in this case an AI, to know how or if it will be useful to converge on something like power seeking. By comparison, I don’t worry that rocks will try to grab power because rocks don’t have purposes that benefit from having power (and more likely don’t have purposes at all!).)
I agree. I welcome suggestions for alternate titles, if anyone has any! I tried myself but didn’t find anything immediately. “No instrumental convergence without considering how the AI will make decisions” isn’t exactly the snappiest title.
EDIT: I actually think “psychology” is pretty good here, despite some flaws.
It seems like the notion of “psychology” that you’re invoking here isn’t really about “how the AI will make decisions.” On my read, you’re defining “psychology” as “the prior over policies.” This bakes in things like “hard constraints that a policy never takes an unsafe action (according to a perfect oracle)” by placing 0 probability on such policies in the prior. This notion of “psychology” isn’t directly about internal computations or decision making. (Though, of course, some priors—e.g. the circuit depth prior on transformers—are most easily described in terms of internal computations.)
Of course: whether a particular AI kills humanity [if we condition on that AI somehow doing stuff resulting in there being a mind upload device [1] ] depends (at least in principle) on what sort of AI it is. Similarly, of course: if we have some AI-generating process (such as “have such and such labs race to create some sort of AGI”), then whether [conditioning that process on a mind upload device being created by an AGI makes p(humans get killed) high] depends (at least in principle) on what sort of AI-generating process it is.
Still, when trying to figure out what probabilities to assign to these sorts of claims for particular AIs or particular AI-generating processes, it can imo be very informative to (among other things) think about whether most programs one could run such that mind upload devices exist 1 month after running them are such that running them kills humanity.
In fact, despite the observation that the AI/[AI-generating process] design matters in principle, it is still even a priori plausible that “if you take a uniformly random python program of length 106 such that running it leads to a mind upload device existing, running it is extremely likely to lead to humans being killed” is basically a correct zeroth-order explanation for why if a particular AI creates a mind upload device, humans die. (Whether it is in fact a correct zeroth-order explanation for AI stuff going poorly for humanity is a complicated question, and I don’t feel like I have a strong yes/no position on this [2] , but I don’t think your piece really addresses this question well.) To give an example where this sort of thing works out: even when you’re a particular guy closing a particular kind of sliding opening between two gas containers, “only extremely few configurations of gas particles have >55% of the particles on one side” is basically a solid zeroth-order explanation for why you in particular will fail to close that particular opening with >55% of the particles on one side, even though in principle you could have installed some devices which track gas particles and move the opening up and down extremely rapidly while “closing” it so as to prevent passage in one direction but not the other and closed it with >55% of gas particles on one side.
That said, I think it is also a priori plausible that the AI case is not analogous to this example — i.e., it is a priori plausible that in the AI case, “most programs leading to mind uploads existing kill humanity” is not a correct zeroth-order explanation for why the particular attempts to have an AI create mind uploads we might get would go poorly for humanity. My point is that establishing this calls for better arguments than “it’s at least in principle possible for an AI/[AI-generating process] to have more probability mass on mind-upload-creating plans which do not kill humanity”.
Like, imo, “most programs which make a mind upload device also kill humanity” is (if true) an interesting and somewhat compelling first claim to make in a discussion of AI risk, to which the claim “but one can at least in principle have a distribution on programs such that most programs which make mind uploads no not also kill humans” alone is not a comparably interesting or compelling response.
or if we prompt it to create a mind upload device
I do have various thoughts on this but presenting those seems outside the scope.
I disagree somewhat, but—whatever the facts about programs—at least it is not appropriate to claim “not only do most programs which make a mind upload device also kill humanity, it’s an issue with the space of programs themselves, not with the way we generate distributions over those programs.” That is not true.
It is at least not true “in principle” and perhaps it is not true for more substantial reasons (depending on the task you want and its alignment tax, psychology becomes more or less important in explaining the difficulty, as I gave examples for). On this, we perhaps agree?
Hmm, I think that yes, us probably being killed by a program that makes a mind upload device is (if true) an issue with the way we generated a distribution over those programs. But also, it might be fine to say it’s an issue with the space of programs (with an implicit uniform prior on programs up to some length or an implicit length prior) itself.
Like, in the example of two equal gas containers connected by a currently open sliding door, it is fair/correct to say, at least as a first explanation: “it’s an issue with the space of gas particle configurations itself that you won’t be able to close the door with >55% of the particles on the left side”. This is despite the fact that one could in principle be sliding the door in a very precise way so as to leave >55% of the particles on the left side (like, one could in principle be drawing the post-closing microstate from some much better distribution than the naive uniform prior over usual microstates). My claim is that the discussion so far leaves open whether the AI mind upload thing is analogous to this example.
I’m open to [the claim about program-space itself being not human-friendly] not turning out to be a good/correct zeroth-order explanation for why a practical mind-upload-device-making AI would kill humanity (even if the program-space claim is true and the practical claim is true). I just don’t think the discussion above this comment so far provides good arguments on this question in either direction.
Is there an isomorphism between the space of possible plans and agents? If such an isomorphism exists, then alignment is the slicing up of agent space with a view of constraining plan space. In this picture, the weaker claim on default behaviour comes from random navigation of the agent (and plan) space.
This likely means both formulations are valid.